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ABSTRACT 


\ 

\ 

J 

An  abstract  model  is  suggested  to  describe  precisely  systolic 
networks  and  to  verify  their  operation.  The  data  items  appearing  on 
the  communication  links  of  such  networks  at  consecutive  time  units 
are  represented  by  data  sequences  and  the  operation  performed  by  the 
network-cells  are  modeled  by  a  system  of  equations  involving  operations 
on  sequences.  The  input/output  relations,  which  descirbe  the  global 
effect  of  the  computations  performed  by  the  network,  are  obtained  by 
solving  the  corresponding  system  of  sequence  equations.  This  input/ 
output  description  can  then  be  used  to  verify  the  operation  of  the 
network. 

The  model  is  supplemented  with  a  simple  computer  language  that 
may  be  used  to  express  any  system  of  causal  equations  describing  the 
operation  of  a  systolic  network.  An  interpreter  is  developed  for 
this  language  to  solve  such  a  system  for  specific  forms  of  the  inputs 
and  to  produce  the  corresponding  outputs.  The  application  of  this 
interpreter  to  the  computational  assessment  of  a  given  systolic  network 
is  equivalent  to  the  simulation  of  its  execution. 

The  abstract  model  is  then  applied  to  the  specification  and  verifi¬ 
cation  of  a  systolic  machine  for  the  computation  of  the  elemental  arrays 
in  finite  element  analysis.  Finally,  possible  organizations  for  complete 
finite  element  systems  are  suggested  based  on  the  idea  of  pipelining 
the  computations  associated  with  the  different  elements. 
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REFERENCES 


1.  INTRODUCTION. 


in  the  past  few  years,  the  concept  of  systolic  architectures  [33]  became 
increasingly  Important  and  has  been  intensively  studied  for  the  design  of 
computer  networks  that  utilize  natural  parallelism  by  moving  the  data  regu¬ 
larly  in  the  network.  This  type  of  architectures  has  two  properties  desirable 
in  VLSI  implementations,  namely,  regularity  and  local  interconnections. 

Although  the  concept  of  systolic  architectures  is  very  well  developed, 
few  techniques  appear  to  be  known  for  a  formal  specification  and  verifica¬ 
tion  of  such  networks.  In  fact.  In  most  papers  on  systolic  networks,  very 

little  formalism  is  used,  and  the  reader  is  usually  left  with  a  few  diagrams 

and  experimental  evidence.  At  best,  an  ad-hoc  proof  technique  is 

}  developed  for  some  examples  but  is  not  generally  usable. 

.  By  treating  systolic  networks  as  a  collection  of  communicating,  parallel 

l 

processes,  some  of  the  techniques  for  the  verification  of  distributed  systems 
)  (see  for  example  [39]  )  may  be  applied  for  the  verification  of  some  correct¬ 
ness  properties  of  systolic  networks  [43.  16].  However,  this  approach  does 

’  not  make  use  of  the  special  properties  of  systolic  networks,  and  hence. 

5  gives  only  rather  general  results. 

in  ill],  a  formal  approach  for  the  representation  of  computational  net- 
worxs  was  proposed.  This  approach  was  elaborated  upon  in  [26.27.53] 

where  the  so-called  wave-front  notation  was  used  to  map  algorithmic 
descriptions  into  systolic  implementations.  Although  this  notation  provides  a 
powerful  tool  that  can  be  used  in  the  automatic  design  of  systolic  arrays 
1  i54],  it  does  not  appear  to  have  the  flexibility  needed  to  describe  general 


systolic  networks. 


The  first  part  of  this  dissertation,  namely  Chapters  2.  3.  4  and  5.  con¬ 
cerns  a  formal  treatment  of  systolic  networks.  We  start  in  Chapter  2  by 

introducing  an  abstract  model  for  systolic  computations.  The  model  is 
based  on  the  representation  of  the  data  items  appearing  on  each  communi¬ 
cation  link  as  an  infinite  data  sequence.  Moreover,  the  operation  of  each 

computational  cell  is  modeled  by  a  set  of  equations  involving  operators  on 
data  sequences.  This  sequence  approach  separates  the  time  and  the  space 
dimensions  of  systolic  networks  and  distinguishes  between  the  networks  func¬ 
tions  and  the  specific  details  of  the  computations,  and  thus  turns  out  to 
lead  to  a  clear  and  precise  specification  of  systolic  computations. 

The  system  of  all  equations  that  model  the  cells  in  a  systolic  network 

represents  an  implicit  relation  between  the  inputs  and  the  outputs  of  the 
network.  By  solving  this  system  of  equations,  we  obtain  an  explicit  formula 
for  the  outputs  of  the  network  in  terms  of  the  inputs.  This  formula  is 

callea  the  network  I/O  description.  The  output  of  a  specific  computation 

may  be  found  by  substituting  the  corresponding  particular  input  into  the  I/O 
description.  Then  a  verification  of  the  computation  requires  only  a  com¬ 
parison  of  the  resulting  output  with  the  specification  of  the  expected  output. 
Tnis  verification  technique  is  described  in  Chapter  3.  where  it  is  also 
applied  to  different  systolic  networks. 

The  applicability  of  our  verification  technique  depends  largely  on  our 

ability  to  manipulate  sequence  operators  and  to  solve  systems  of  sequence 
equations.  The  solvability  of  the  equations  is  discussed  in  Section  3.4 
where  we  snow  that  it  is  always  possible  to  obtain  an  analytical  expression 

for  the  solution  of  systems  of  equations  resulting  from  our  model  of  systolic 

computations.  However,  the  analytical  solution  may  sometimes  be  very  com- 


3 


pncatea  and  thus  not  practical.  For  this  reason,  we  introduce  in  Chapter  4 
a  computer  solver  that  may  be  used  to  find  numerically  the  solutions  of 
sequence  equations. 

Finally,  m  Chapter  5.  we  conclude  the  first  part  of  the  dissertation  by 

discussing  some  topics  that  demonstrate  the  power  of  the  abstract  model 
and  tne  flexibility  of  the  sequence  notation. 

it  should  be  noted  here  that  our  abstract  model  carries  some  properties 
of  a  model  called  “automaton  networks'  [19]  which  in  turn,  is  a  modification 

of  the  von  Neumann  cellular  arrays  [52. 8].  It  also  carries  some  properties 
of  an  abstract  model  [35]  used  by  Leiserson  and  Saxe  to  prove  that  any 
synchronous  system  can  be  converted  Into  an  equivalent  systolic  system. 
Moreover,  the  objective  of  the  model  is  similar  to  that  of  another  model 
developed  independently  by  Chen  and  Mead  [10].  Both  models  separate  the 
network  function  from  the  specific  details  of  a  certain  computation  and  allow 
for  a  precise  specification  and  a  formal  verification  of  systolic  networks. 
However,  m  our  model  we  follow  an  algebraic  approach,  while  the  model  in 

[10]  is  oriented  toward  a  procedural  approach.  More  specifically,  a  pro¬ 
cedural  language  is  used  for  the  specification  of  both  the  network  and  its 
inputs,  and  the  description  of  the  output  is  obtained  by  applying  fixed  point 

tneory  (49i  for  finding  the  “least  solution"  of  systems  of  recursive  functions. 

The  second  part  of  the  dissertation  consists  of  the  Chapters  6.  7.  8 
ano  0.  in  this  part,  we  apply  the  abstract  systolic  model  to  the  specifica¬ 
tion  arid  verification  of  a  special  purpose  system  for  finite  element  analysis. 

very  criefiy.  the  finite  element  method  (see  e.g.  [57]  )  is  a  technique 
tor  solving  a  partial  differential  equation  on  a  certain  domain  Q  with  given 
conditions  on  tne  ooundary  of  Q.  in  the  case  of  linear  equations,  it 
involves  essentially  tne  following  four  oasic  steps:  1)  The  generation  of  a 


unite  element  mesn  that  divides  Q  into  m  finite  elements.  2)  The  genera¬ 
tion  ot  elemental  stiffness  matrices  and  load  vectors  for  each  finite  element. 
3i  The  assembly  of  the  global  stiffness  matrix  H  and  load  vector  b.  4)  The 
solution  of  the  resulting  linear  system  of  equations  Hu=b. 

Due  to  its  various  applications  in  engineering  and  mathematics,  many 
finite  element  software  systems  have  been  developed  (see  e.g.  [44]  )  and 
wiaeiy  used  for  the  solution  of  a  variety  of  boundary  value  problems.  How¬ 
ever.  the  time  required  to  complete  any  finite  element  computation  on  a 
serial  computer  may  become  extremely  large  for  many  realistic,  practical 
problems.  This  usually  imposes  severe  limitations  on  the  size  and  type  of 
the  problems  that  can  be  handled,  and  leads  engineers  to  use  lower 
degrees  of  approximation  and  hence  less  accurate  models.  This  is  espe¬ 
cially  true  if  a  design  procedure  is  based  on  the  results  of  running  a  finite 
element  solver  repeatedly,  with  a  certain  decision  to  be  taken  after  each 
run  (interactive  design),  or  if  the  result  of  the  analysis  is  to  be  reported 
within  certain  time  limits,  as  is  often  the  case  In  military  applications. 

For  this  reason,  many  researchers  have  considered  the  use  of  some 
type  of  parallel  processing  in  the  finite  element  analysis.  In  fact,  in  a  sur¬ 
vey  on  highly  parallel  computations  [20],  Haynes  et  al.  stressed  the  fact  that 
one  of  the  most  important  areas  in  which  parallel  computations  can  be 
explored  is  the  solution  of  partial  differential  equations,  especially  the  finite 
element  analysis. 

Tne  use  of  array  processors  for  speeding  up  the  finte  element  compu¬ 
tations  were  considered  by  many  researchers;  for  instance.  Noor  et  al. 
['ll. -10]  studied  algorithms  for  performing  a  finite  element  dynamic  analysis 
on  tne  CDC  Star- 100  computer.  Along  the  same  lines.  Kamel  et  al.  [48.36] 
tuaiea  the  usefulness  of  array  processors,  combined  with  mini  and  super 


mini  computers,  in  finite  element  computations.  Both  studies  showed  that 
only  a  limited  speed  up  can  be  obtained  via  array  processors,  especially  in 
the  generation  of  the  stiffness  matrices  and  load  vectors. 

Similarly,  the  use  of  general  purpose  multiprocessors  in  the  solution  of 
the  linear  systems  appearing  in  finite  element  computations  were  studied. 

For  example,  in  [14.171.  the  Cm*  multiprocessor  was  used  to  solve  linear 
systems  by  iterative  techniques.  The  experiments  showed  that  only  a  limited 
number  of  processors  can  be  used  if  congestion  is  to  be  avoided  in  inter¬ 
process  communications.  No  studies  have  been  reported  in  the  literature 

on  the  use  of  a  general  multiprocessor  system  to  generate  the  stiffness 
matrices  or  load  vectors. 

in  [47.55],  the  problem  was  partitioned  into  a  large  number  of  separate 
processes,  and  each  process  was  assigned  to  a  processor.  This  system 

also  incorporates  the  use  of  a  posteriori  error  estimates  and  a  correspond¬ 
ing  refinement  of  the  finite  element  grid.  Although  this  adaptive  approach 
appears  to  be  very  attractive  for  parallel  processing,  it  was  shown  in  [561 
that  various  parallel  configurations,  discussed  in  the  literature,  are  not 
expected  to  give  a  satisfactory  gain  in  the  processing  speed  since  the  times 
for  communication  and  data  movements  between  the  different  processors 
dominate  the  running  time. 

The  most  significant  attempt  in  this  area  is  the  design  of  a  finite  ele¬ 
ment  macnme  at  the  institute  for  Computer  Applications  in  Science  and 
Engineering  ilCASE).  at  the  NASA  Langley  Research  Center  [29.28.451.  In 
mis  project,  a  16  on  microprocessor  with  multiply  and  divide  hardware  is 
assigned  to  eacn  node  in  the  finite  element  grid.  Each  processor  is 
airectiy  connected  to  eignt  immediate  neighbors,  and  a  global  bus  connects 
an  -h-j  processors  :ri  me  system  The  motivation  for  this  connection  is  that 


most  of  tne  data  required  to  complete  the  calculations  at  each  node  come 
from  the  immediate  neighbors  of  that  node.  Such  a  machine  is  planed  to 
have  1024  processors  and  is  still  subject  to  experimental  studies. 

Aimougn  its  idea  is  attractive,  the  finite  element  machine  has  many 
serious  drawbacks:  First  of  all.  the  direct  relation  between  the  number  of 
processors  and  the  number  of  nodes  in  the  finite  element  grid  imposes 
severe  limitations  on  the  size  of  the  grid.  Secondly,  a  suitable  mapping 
nas  to  oe  found  which  associates  the  nodes  with  the  processors,  in  14].  S. 
H.  Bokhari  showed  a  possible  method  for  accomplishing  this  task  but  stated 
tnat  the  problem  becomes  very  complex  and  time  consuming  for  regular 
meshes  of  size  iarger  than  30x30.  and  is  even  more  complex  for  irregular 
rnesnes.  Finally,  m  130]  the  authors  concluded  that  some  additional  tree 
iike  naraware.  besides  the  global  bus.  is  needed  to  implement  global  func¬ 
tions  such  as  the  sum  and  maximum  over  quantities  distributed  over  the 
nodes  15].  According  to  Jordan  128],  the  finite  element  machine  is  most 
suitable  if  the  interconnections  between  its  processors  follow  the  same  pat¬ 
tern  as  me  finite  element  mesh  which  is  a  very  rigid  restriction. 

By  carefully  analyzing  the  different  steps  in  the  linear  finite  element 
analysis,  we  may  note  that  the  involved  computations  are  highly  regular,  and 
tnat  tney  can  be  divided  into  separate  phases,  where  each  phase  depends 
„n:y  on  tno  preceding  phase.  Hence  the  data  can  be  transferred  from 
r.r.ase  to  phase  in  a  pipe-lined  fashion.  The  computation  within  each  phase 
is  a  iso  wen  structured  and  mostly  compute  bound,  which  makes  it  a  very 
suitacio  application  for  systolic  architectures. 

Ar:o;  criefiy  introducing  the  finite  element  analysis  in  Chapter  6.  we 
•iPt:  y  m  Chapter  7.  the  abstract  model  to  the  specification  and  verification 
r  j  systouc  system  for  tne  generation  of  tne  elemental  arrays.  This  gen- 


eration  is  often  a  major  time  consuming  task  in  workaday  finite  element 
computations. 

in  Chapters  8  and  9.  we  describe  the  architecture  of  a  complete  finite 
element  system.  The  suggested  system  is  based  on  the  idea  of  pipelining 
the  computations  associated  with  the  elements  rather  than  processing  them 
in  parallel  on  an  array  of  processor,  as  is  the  case  in  the  machine 
described  in  [293.  This  latter  machine  uses  an  iterative  scheme  for  the 
solution  of  the  linear  system  of  equations  resulting  from  the  finite  element 
formulation.  It  reduces  the  processing  time  considerably  by  employing  a 
number  of  processors  proportional  to  the  size  of  of  the  problem.  On  the 
other  hand,  the  pipeline/systolic  approach  may  be  applied  with  either  direct 
or  iterative  solution  schemes,  and  it  results  in  an  architecture  that  is  not 
dependent  on  the  number  of  elements  in  the  mesh  that  covers  the  domain 

of  the  problem.  Basically,  it  uses  a  fixed  number  of  processors  to  com¬ 
plete  the  analysis  in  a  time  proportional  to  the  size  of  the  problem.  Each 

aDproach  has  its  own  merits  and  may  be  suitable  for  certain  applications. 

it  should  be  mentioned,  however,  that  the  architecture  of  the  system 

that  applies  a  direct  solution  scheme  has  the  disadvantage  of  being  depen¬ 
dent  on  the  bandwidth  of  the  global  stiffness  matrix.  Systems  that  apply 
iterative  solution  schemes  do  not  share  this  limitation  but  are  relatively 
-.lower,  in  fact,  it  turns  out  that  the  time  for  completing  each  iteration  step 
■  s  proportional  to  the  number  of  elements  in  the  mesh. 


2.  AN  ABSTRACT  SYSTOLIC  MODEL 


Systolic  architectures,  pioneered  by  H.  T.  Kung.  are  becoming  increas¬ 
ingly  attractive  due  to  continuous  advances  in  VLSI  technology.  This  type  of 
network  architectures  has  two  properties  very  desirable  in  VLSI  implementa¬ 
tions;  namely,  regularity  and  local  Interconnections. 

A  systolic  network  can  be  viewed  as  a  network  composed  of  a  few 
types  of  computational  cells,  regularly  interconnected  via  local  data  links  and 
organized  such  that  streams  of  data  flow  smoothly  within  the  network.  For 
an  introduction  to  systolic  architectures,  we  refer  to  [33.31]  where  further 
references  to  specific  examples  are  given. 

As  an  introductory  example,  we  briefly  review  a  simple  systolic  network 
for  the  computation  of  one  dimensional  convolution  expressions  [31].  More 
specifically,  given  a  sequence  of  numbers  U1 ,  x2>  ...  x^}.  and  a  sequence 

of  weights  (w^ ,  ...  w^).  we  want  to  compute  the  sequence  [y1 .  y2<  ... 

y„f  ^1  where  each  y(.  is  defined  by; 
k 

y.=  r  w.  x.  .  ,  (2.1) 

i  .“1  i  i+/-l 

Figure  2.1  shows  the  building  cell  of  the  1-D  convolution  network  under 
discussion.  It  is  a  multiply/add  cell  with  a  one  word  memory  to  store  a 
real  number  w. 

At  each  clock  pulse,  the  cell  receives  two  input  data  items;  x.fl  and 
y  ,  performs  its  computation  and  delivers  at  the  next  clock  pulse  the  out¬ 
puts  x-  =  x.  and  y„  =  y.  +  w  x.  .  Figure  2.2  shows  three  such  cells 
connected  into  a  network  that  performs  the  convolution  calculation  for  the 
case  k=3  The  elements  x^ .  x2>  ...  x^  are  pumped  In  at  the  left  end  of 


the  network,  each  separated  from  the  other  by  one  time  unit,  and  zeroes 
are  pumped  in  at  the  right  end.  To  illustrate  the  operation  of  the  array,  we 
show  m  Figure  2.3  the  relative  location  and  value  of  each  data  item  at 
times  t=3.4.5  and  6.  where  t=1  is  the  time  at  which  the  network  started  its 
execution.  By  following  the  data  paths,  it  is  easy  to  ascertain  that  the  output 
of  the  array  will  include  the  sequence  (y1 ,  y^.  ..  yn+1_fc). 

in  order  to  specify  and  verify  formally  the  operation  of  any  systolic  net¬ 
work.  we  have  to  consider  both  the  spatial  topology  of  the  network  and  the 
timing  of  the  data  movement  on  its  communication  links.  In  this  chapter, 
we  suggest  a  formal  model  designed  specifically  to  conveniently  separate 

the  space  dimension  of  the  systolic  architecture  from  its  time  dimension. 
The  separation  makes  the  specification  of  systolic  networks  clearer  and 

leads  to  a  formal  technique  for  the  verification  of  their  operations.  We 
start  by  considering  the  mathematical  basis  of  the  model. 

2.1.  Data  sequences  and  causal  operators. 

We  define  a  data  sequence  to  be  an  infinite  sequence  whose  elements 

are  members  of  the  set  R.=R  u  (6).  where  R  is  the  set  of  real  numbers 

0 

and  6  denotes  a  special  element,  not  belonging  to  R.  called  the  "don't 

care  element".  We  extend  any  operator  defined  on  R  to  R ^  in  one  of  the 

following  two  ways: 

1)  By  adding  the  rule  that  the  result  of  any  operator  involving  0  is  6.  For 

example,  we  extend  the  usual  arithmetic  operations  op'  =  or 

'/',  by  adding  the  following  rule 

0  'op'  x  =  x  'op'  0  =  0  for  all  xeff^ 

This  class  of  operators  on  R will  be  called  0-regular  operators. 

2)  By  treating  0  as  a  special  symbol  that  affects  the  result  of  the  opera- 


tion.  This  class  will  be  called  non  0-regular  operators.  For  example,  we 
will  consider  later  the  binary  operator  @  such  that  for  any  x.yefl^ . 

x  ©  y  =  x  +  y.  it  x.y*6:  x©0  =  0®x=x  (2.2) 


Two  other  non  0-regular  operators,  that  will  be  used  in  Section  3.3.  are 


the  operators  minfl  and  maxfl  defined  on  an  ordered  pair  (x.y).  x.yef?0  by 


min^Of.y) 

and 


minfx  .y) 

y 


if  x.y* 0 

if  x =0  or  y  = 6 


max^Gr.y) 


maxfx  .y) 
x 


if  x.y* 0 

if  x  =0  or  y=0. 


where  min()  and  maxO  carry  the  usual  meaning  on  R.  The  reason  for  dis¬ 
tinguishing  between  0-regular  and  non  0-regular  operators  will  become 
clearer  as  we  proceed  with  the  discussion. 


Let  N  be  the  set  of  positive  integers.  Then  any  data  sequence  v  is 
defined  as  a  mapping  from  N  to  that  is.  the  image  element  i ?(/).  i  eN . 
is  the  ith  element  in  the  sequence.  The  set  of  all  data  sequences,  that  is 

X 

the  set  of  ail  such  mappings,  will  be  denoted  by  RQ  =  (  v  !  i i.n-R^). 

X 

Any  operation  defined  on  R  ^  is  extended  to  R&  by  applying  the 
operation  element-wise  to  the  elements  of  the  sequences  with  0  being  the 


result  of  any  undefined  operation.  For  example,  if  'op'  is  a  binary  operation 

X 

defined  on  R^.  then  for  all  .  we  have  7j ^'op'  77^  =  h3-  where  for 

all  ieN.  7?3 (/)  is  given  by 

,  i?,(/)  op'  77 0 C/ )  if  77_0)  is  defined 

v>  ■  { 

1  5  otherwise. 


We  will  also  use  scalar  operations  on  sequences  For  example,  th> 

scalar  product  of  a  sequence  and  a  number  weR  is  defined  as  the 

X 

sequence  C  =  w  .  77  for  which  C(i)  =  w  77 o).  ieN. 


j*  •->  ->  "j. 


urWvKxii 


Given  the  previous  definition  of  data  sequences,  we  define  the  set  of 

X 

bounded  data  sequences  c  R  ^  to  contain  those  sequences  having  only 
a  finite  number  of  non-0  elements.  It  is  then  natural  to  introduce  the  ter¬ 
mination  function  T.R^-N  such  that  for  any  TjeR^.  T (7?)  is  the  position  of 
the  last  non-0  element  in  tj:  in  other  words: 


for  any  tjcR^.  T (T))=i  --  and  7 j(/)=0  for  /  >/. 

in  this  dissertation,  we  will  only  consider  bounded  data  sequences.  These 
will  be  denoted  by  small  greek  letters  and  simply  referred  to  as  sequences. 

Two  special  sequences  will  be  repeatedly  used,  namely  the  don't  care 

*  * 
sequence  0  and  the  zero  sequence  t.  The  first  is  defined  by  0  (f)=0  for 

all  t.  and  the  second  by  i(f)=0  for  l<f<T(i)  and  any  arbitrary  large  7" ( O. 

in  addition  to  the  operators  extended  from  R&  to  ,  we  may  also 

define  operators  directly  on  RQ.  In  general,  an  n-ary  sequence  operator  r 

is  a  transformation  r.W.]n-«A  where  IR.  ]n=fif.  xR.  x  •  •  -R.  is  the  cartesian 

product  space  of  n  copies  of  RQ-  Many  operators  of  this  type  will  be 

defined  in  Section  2.4;  we  introduce  here  only  a  basic  unary  operator  that 


will  be  used  frequently  in  our  discussion,  namely  the  shift  operator 
k  —  — 

0  defined  for  any  k>  1  by 


.n  . 

n  i  =  i7 


where 


{  ^ 
l  Hi-k) 


if  iZk 


if  i  >k 


More  descriptively,  n  inserts  k  O-elements  at  the  beginning  of  a  sequence. 
For  example  if  €=a1  ,a2  .a3  .a4 ,0.0....  then  T(£)=4  and 


*  (/ )  =  a, 


K/<r<£) 


n  $  =  0.0. 0. .a2  .a^  .a^  ,0,0,0.... 


it  is  easy  to  verify  that  the  termination  function  generally  satisfies 


v  >  •  •  ,*  ,♦  .  .  .  \  ,•  /  * 

'  \  \  V  *  V  •  *  V 


•  •  ■ '  ■  •.•'•.•.vvw  •  -.a  • 


T(Clki)  =  T(Z)+k 

it  Is  also  clear  that  we  can  define  a  sequence  operator  by  combining 
previously  defined  sequence  operators.  For  example  we  might  define  an 
operator  r^6xR0xRQ-RC)  as  follows: 

ra.v.o  =  n  u  +  v  *  a 

where  square  brackets  are  used  (or  grouping  and  parenthesis  for  enclosing 
the  arguments  of  the  operator. 

We  define  a  causal  operator  to  be  any  n-ary  sequence  operator 
r:[RQ]n-R0  which  satisfies  the  causality  property  in  the  sense  that  the  ith 
element  of  any  of  its  operands  can  only  affect  the  ith  element  of  its  image 
for  />/.  In  order  to  formulate  this  more  precisely,  assume  that  for  any 

given  sequences  VreRQ  r=1.2 . n,  the  image  under  r  is  |=r(7j.j...,7?/....i70). 

Then  r  is  a  causal  operator  if  the  replacement  of  any  operands  Kr<r>, 

by  other  sequences  tir  satisfying 

t 

T)r  (t)  =  7 lr(t)  Kt</ 

results  in  an  image  sequence  £'  =  r(7?1...,i7f  ■■■Vni  for  which 
('(t)  =  ((t)  l<f*/ 

in  other  words,  the  value  of  {(/)  depends  only  on  the  first  i-l  elements 

of  7?r.  l<r<n. 

Similarly,  we  may  define  weakly-causal  operators  for  which  the  ith  ele¬ 
ment  of  the  image  sequence  Hi)  depends  only  on  the  first  i  elements  of 

the  operands  Tfr.  Kr<n  instead  of  the  first  i-l  elements.  With  this,  it  is 

12  2  1  1 
easily  seen  that  the  combination  r  r  (or  r  r  )  of  a  causal  operator  r 

2 

and  a  weakly-causal  operator  r  is  a  causal  operator.  For  instance,  the 

H  k  2 

shift  operator  n  is  causal  and  hence,  the  combined  operator  n  r  is 

2 

causal,  for  any  weakly  causal  operator  r 


In  the  previous  discussion,  we  only  considered  the  space  R ^  of 
sequences  of  real  numbers.  However,  other  sequence  spaces  may  be 
defined  as  well,  by  starting  with  a  set  different  from  the  set  of  real 
numbers  R.  For  example,  if  we  start  from  the  set  of  boolean  truth  values 
B  =  [true,  false),  then  we  may  define  the  space  of  boolean  sequences 
=  ;  (:N-BU(0)  ). 

2.2.  The  abstract  model. 

We  begin  the  specification  of  the  mathematical  model  used  in  our  verif¬ 
ication  technique  with  the  definition  of  a  loop-iess.  directed  multigraph 
Gtv.E as  a  structure  composed  of 

(a)  a  set  V  of  nodes; 

(b)  a  set  E  of  directed  edges. 

(c)  two  functions  .  satisfying  the  condition  that  for  any  edge 

eeE. 

<p_(e)  *  «p+(e)  (2.3) 

For  each  edge  eeE.  the  nodes  <P_(e)  and  <p+(e)  are  the  source  and 
destination  node,  respectively,  of  that  edge.  Clearly,  the  condition  (2.3) 
prevents  any  nodal  loops  in  the  graph.  This  definition  of  a  multigraph 
allows  any  two  nodes  to  be  connected  by  more  than  one  edge  in  the  same 
direction,  a  property  that  may  be  useful  when  we  represent  systolic  networks 
by  this  abstract  model. 

As  usual  in  graph  terminology,  for  any  node  v ev ,  the  edges  (e;«p_ 
(e)=v)  directed  out  of  v  are  termed  the  OUT  edges  of  v.  while  the  edges 
[e  ;vf(e)=v)  directed  into  v  are  termed  the  IN  edges  of  v.  Accordingly,  the 
iN-degree  and  OUT-degree  of  v  are  the  number  of  in  edges  and  OUT 
edges  of  v.  respectively.  Any  node  v  ev  with  IN-degree  zero  or  OUT- 
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degree  zero  is  called  a  source  or  a  sink,  respectively.  All  other  nodes  are 
called  interior  nodes  of  Q.  We  shall  use  the  notation  an(J  y  for 

the  subsets  of  V  containing  the  source,  sink  and  interior  nodes  of  V. 
respectively.  Of  course,  the  condition  u  vT  u  =  V  is  always  satis¬ 
fied. 

With  this  notion  of  a  multigraph,  we  define  our  abstract  systolic  model 
to  be  composed  of  the  following  components. 

IA1J  A  multigraph  G(V,E,<p_.ip+). 

IA21  A  coloring  function  coi.E-Cg,  which  maps  E  into  a  given  finite  set  of 
colors  C^.  and  hence  assigns  a  color  to  each  edge  in  E.  The  coloring 
function  is  assumed  to  satisfy  the  condition  that  the  different  IN  edges  of 
any  node  in  V^uv^.  have  different  colors,  and  correspondingly  that  the  dif¬ 
ferent  OUT  edges  of  any  node  in  l/gUVj  have  different  colors.  Edge  colors 
will  be  denoted  by  lower  case  letters. 

IA3]  For  each  edge  e  €E.  a  sequence  of  a  given  sequence  space  is 
specified. 

IA4J  For  each  interior  node  v€V/  with  IN  degree  m  and  OUT  degree  n.  we 
are  given  n  causal  m-ary  operators  r'^  which  specjfy  the  -node 

I/O  description".  More  specifically,  if  v1  ■  /=l.***.m  and  ( .  i  =  l.*,,.n  are 

the  sequences  associated  with  the  IN  and  OUT  edges  of  v,  respectively, 

then  the  n  relations 

i  =  -  •  •  ,i)m)  f  =  l,***.n 

are  the  I/O  description  of  v.  The  different  IN  and  OUT  edges  of  v  are 

distinguished  in  the  I/O  description  by  their  colors. 

Since  [A2]  ensures  that  all  edges  terminating  at  a  given  node  v  have 

different  colors,  it  follows  that  any  edge  eeE  may  be  uniquely  identified  by 
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a  pair  (y.v).  where  y=co((e)  and  v=<p+(e).  To  simplify  the  notation,  the  pair 
(y.v)  will  often  be  written  in  the  form  y  ,  and  the  sequence  associated  with 
that  edge  will  be  identified  by  the  symbol  where  we  replaced  the  letter 
y  by  its  corresponding  greek  letter  7?. 

For  practical  applications,  it  is  generally  desirable  to  identify  the  nodes 
of  the  network  by  appropriate  labels  which  correspond  to  the  problem  at 
hand.  This  means  that  we  introduce  a  set  L  of  labels  together  with  a  one- 
to-one  function  ?:V  -L  from  V  onto  L.  In  our  examples,  we  usually  identify 
the  nodes  directly  with  their  labels. 

2.3.  The  general  systolic  network. 

By  giving  a  physical  interpretation  to  each  component  in  the  general 
abstract  model  we  obtain  a  general  systolic  network.  The  basic  idea  of  this 
interpretation  may  be  summarized  as  follows: 

Each  interior  node  represents  a  computational  cell  and  each  source/sink 
node  corresponds  to  an  input/output  cell  for  the  overall  network.  To  distin¬ 
guish  in  our  figures  the  computational  cells  from  the  I/O  cells,  we  depict 
computational  cells  by  circles  and  I/O  cells  by  squares. 

Each  edge  x^eE  represents  a  unidirectional  communication  link  between 
the  two  ceils  it  connects.  The  sequence  associated  with  x^  then  comprises 
the  data  items  that  appeared  on  it  in  consecutive  time  units.  More  specifi¬ 
cally.  if  Zv  is  the  sequence  associated  with  xy.  then  the  ith  element  of  (v. 
namely  {  0)  is  the  data  item  that  appeared  on  xy  at  time  t-i  units,  where 
f  =  l  is  the  time  at  which  the  network  started  its  operation. 

Clearly,  the  sequence  space  from  which  the  sequences  are  taken 
corresponds  to  the  type  of  data  items  that  may  be  carried  on  the  commun¬ 
ication  links.  in  this  dissertation,  we  will  consider  only  networks  in  which 


the  communication  links  may  carry  real  numbers.  In  other  words,  any 
sequence  is  assumed  to  be  in  the  space  . 

For  an  interior  node,  the  node  I/O  description  describes  the  computa¬ 
tions  performed  by  the  cell  corresponding  to  that  node.  We  illustrate  this 
with  two  simple  examples: 


Figure  2.4  -  A  delay  cell  Figure  2.5  -  A  multiply/add  cell 

EX  1:  The  node  shown  in  Figure  2.4  represents  a  simple  latch  cell 

which  produces  at  any  time  f>1  on  its  output  link  the  same  data  item 

that  appeared  on  its  input  link  at  time  f-1.  At  time  r  =  1.  we  have 

7j(l)=0.  which  corresponds  to  the  fact  that  at  the  beginning  of  the  net¬ 
work  operation  no  specific  data  item  appeared  on  the  output  link. 

EX  2:  The  operation  of  the  multiply-add  cell  mentioned  in  Section  2.1 
and  shown  in  Figure  2.1  may  be  represented  by  the  following  node  I/O 
descriptions: 

"  n  «/„ 

*0  =  n  {7>in  +  *  •  V 

where  w  eR  is  a  given  real  number  and  ljn.  vjn ■  iQ  and  v0  are  the 

input  and  output  sequences  of  the  node  as  shown  in  Figure  2.5. 

At  this  point,  it  may  be  useful  to  note  that  if  a  6-regular  operator  is 

used  to  model  a  computational  cell,  then  this  cell  treats  5  as  a  "don  t 
know"  quantity,  and  consequently,  the  result  of  any  operation  cannot  be 


known  if  any  of  the  operands  is  not  known.  On  the  other  hand,  non  0- 
regular  operators  are  used  to  model  computational  cells  which  treat  0  as 
a  special  symbol  that  affects  the  result  of  the  operation.  Hence,  each  phy¬ 
sical  communication  link  in  networks  containing  cells  of  this  type  should  be 
augmented  by  an  additional  wire  to  indicate  whether  the  link  carries  valid 
data  or  not.  The  operation  of  each  cell  is  then  dependent  on  this  addi¬ 
tional  piece  of  information. 

Since  in  any  practical  dynamic  system  any  data  item  produced  by  a 
computational  cell  at  time  t  depends  only  on  the  data  provided  to  that  cell 
at  times  less  than  t.  we  immediately  see  the  importance  of  the  condition 

imposed  in  Section  2.2  on  the  node  I/O  descriptions,  namely  that  exclusively 
causal  operators  in  the  sense  of  Section  2.1  are  to  be  used.  We  also 
note  that  with  the  model  described  above,  the  computational  power  of  each 
cell  is  not  limited  to  simple  arithmetical  operations,  in  other  words,  a  cell 
could  be  an  intelligent  cell  that  can  perform  elaborate  calculations  provided 
oniy  that  we  can  express  these  calculations  in  terms  of  causal  operators. 

Clearly,  operators  on  sequences  play  an  important  role  in  the  abstract 
model,  in  the  next  section,  we  introduce  additional  sequence  operators  that 
are  defined  directly  on  R  . 

2.4.  Additional  operators  on  sequences. 

it  was  shown  in  the  last  section  that  element-wise  operators  and  the 
shift  operator  may  be  used  to  model  simple  computational  cells.  However, 

these  operators  are  not  sufficient  to  model  cells  with  memory  capabilities  or 
with  complex  control  structures.  Here,  we  introduce  new  sequence  operators 
that  may  be  used  to  express  the  computation  of  some  elaborate  types  of 
cells.  For  simplicity.  given  any  operator  T:[RQ)n-R  .  the  notation 

tr<€ , .  -  -  •  .£  )l(f).  will  be  employed  to  designate  the  tth  element  7?(f)  of  the 
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image  sequence  ri=r (£1#*  ••.£  >.  This  is  consistent  with  the  convention  of 
using  square  Brackets  for  grouping.  We  will  also  use  the  symbol  -  for  in¬ 
teger  division  and  the  Fortran  function  mod  ()  that  specifies  the  remainder  of 
an  integer  division.  We  start  by  generalizing  the  definition  of  the  shift 
operator  given  in  Section  2.1. 

The  Shift  operator  nr  R ^  -  R  ^  is  defined  for  any  r  by 
r  fi  if  r  >  0  and  t<r 

inr  4J(f)  = 

l  Ht-r)  otherwise. 

Hence,  for  r>0.  fir  inserts  r  O-elements  at  the  beginning  of  a  sequence 
and  therefore  models  the  computation  of  a  delay  ceil.  On  the  other  hand, 
for  r<0.  n r  trims  the  first  r  elements  of  the  sequence  and  thus  is  a  non 

causal  operator  which  cannot  be  used  to  model  computational  ceils.  The 
role  of  the  negative  shift  operator  is  to  provide  in  the  proofs  an  inverse  for 
the  positive  shift.  More  precisely,  for  any  sequence  £.  we  have  (l~r  nr  £  = 
i.  The  converse  is  not  always  true,  in  the  sense  that  Qr  n  r  i  =  t  only 
if  for  t<r. 

The  Zero  Shift  operator  -*  R 6  has  the  same  definition  as  fir  except 

that  Qq  inserts  r  zeroes  at  the  beginning  of  a  sequence  instead  of  r  e- 

elements.  The  zero  shift  operator  is  useful  in  modeling  delay  cells  in  net¬ 
works  that  initially  set  the  data  on  their  communication  links  to  zero.  in 

such  networks  we  must  assume  that  the  entries  corresponding  to  the  time 

t=1  m  any  non  input  sequence  are  equal  to  0  rather  than  6 

r  k  s  —  _ 

The  Accumulator  operator  A  :  is  defined  to  model  a  cyclic 

accumulator  that  starts  operation  at  time  t=r,  accumulates  a  new  element 

every  s  time  units  and  restarts  a  new  cycle  every  sk  time  units.  The 

accumulator  operator  can  be  defined  in  terms  of  the  following  algorithm  that 

computes  [Ar  k  ,s  z ](f )  for  any  f>0.  given  the  elements  £(/),  / 


20 


IP  (f<r)  THEN  Wr'*,,s4W)  =  0  /*  accumulator  is  idle  */ 

ELSE 

BEGIN 

t  =  t  -  modUt-r )  +  sir)  /*  time  of  last  reset  */ 
na  =  (tf-f  )  +  s)  +  1  /*  number  of  elements  accumulated  */ 

r*.  'ja"1 

M  '  '  {](f)  =  E  {(f  +s/)  /*  result  of  accumulating  na  elem.*/ 

/  =0 

END 


Evidently,  tills  algorithm  is  equivalent  with 


lAr-k's  tut)  = 


na-1 

E  {<fr+s/) 
/=0 


t<r 

t>r 


where  na  and  f  are  as  specified  before.  As  an  example,  let 


£  =  a1 .0.J  .Sp  -f>2 .  •  •  •  .ay  .by  ,0,0.  ■ 


(2.4) 


then 


2  3  2 

A  ,'3'  {  =  G.P1  .•.P1  tb2.«.P1  +P2tP3.#,P4.#,P4+b5.0.P4tP5tP6.«,f>7.»,O. 


where  •  denotes  an  element  that  is  equal  to  the  preceding  one. 

wy  i  wn  —  n  — 

The  Multiplexer  operator  Mr  .  (£■).•  is  defined  to 

model  a  multiplexer  that  has  n  inputs  .  it  starts  its  operation  at 

time  t=r  and  periodically  multiplexes  its  inputs  with  a  time  ratio  of 

w  l:w2:  •  •  •  :wn .  if  the  length  of  the  multiplexer  cycle  is  denoted  by 
n 

k=  £  w  .  then  the  following  algorithm  defines  the  multiplexer  operator 
@  =  1  ® 

w  1  wn 

IF  (f  <r)  THEN  lMf  .  (£.,,•••.{„)]»)  =  6  /*  multiplexer  idle  V 

ELSE 

BEGIN 

f  =  f  -  modUt~r )  +  X)  /*  start  of  current  cycle  */ 

c 

Find  the  largest  integer 
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9 

such  that  (t-t  )  <  £  wt  /*  determine  interval  within  cycle  V 
C  /=1  1 

w  1  wn 

lMr  (4^.*  •  =  (&(t)  /*  chose  corresponding  Input  */ 

END 

As  an  example,  let 

C  =  ara2..-..a7.a8.a9.0.0.-  •  • 
and 

V  *  .dg ,  •  •  •  .bj. 0.0.0.  •  •  • 

then 

1  2 

Mg'  (C.7?)  =  0.0.a3.h4.h5.a6.h7.0.a9.0,*  •  • 

It  is  also  interesting  to  note  that  the  multiplexer  operator  can  he  used 
to  model  a  de-multiplexer  cell.  For  example,  if  we  want  to  sample  the 

sequence  4  at  times  t=r.2r,3r .  then  we  may  express  this  operation  as 

(4,0*)  where  0*  Is  the  don't  care  sequence  introduced  earlier. 

The  multiplexer  operator  can  be  used  to  define  two  further  operators, 
namely,  the  expansion  and  the  piping  operators. 

The  Expansion  operator  :RQ  -RQ  models  a  cyclic  memory  that  is  loaded 
at  time  t=r  and  is  overwritten  every  k  time  units.  It  is  formally  defined  by 

ej  v  =  m\ .  tv  .  nv  .  n2v  .  •••  .  n*-1??). 

which  on  the  basis  of  the  definition  of  the  multiplexer  operator  may  be 
rewritten  as 

° 

>.  V  tt  ”t  )  t>r 

where  t  =  mod  tit -r)  +  k).  For  example,  with  4  of  (2.4)  we  have 
E^4  =  •  •  • 

Note  that  the  accumulator,  multiplexer  and  expansion  operators  are 


’  •  • « 


V-'- 


L"V 


K> 


& 

’V' 

:-V* 


* 
«  ™  « 


v 


a  . 


& 


•  .  *i 

>• 


P:- 
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weakly  causal  operators.  Besides  the  causal  and  weakly  causal  operators 
used  in  modeling  computational  ceils,  some  sequence  operators  may  be 
introduced  for  the  sole  purpose  of  allowing  us  to  simplify  the  description  of 
data  sequences.  Following  are  two  such  operators: 

The  Piping  operator  P*  :  [fl\  ]m-p.  defined  by 

/77  O  0 


Pk  ,-n1  rn.  _  ..k . A,  1 

pm  (V  .  *  •  •  .77  )  =  Mf  (77  . 


_(/-1)k  / 

.  n  7?  , 


_(m-l)k  m, 
.  n  v  > 


and  T  (P  (77  .  •  •  •  .7?  ))  =  m/r .  in  other  words.  P*  concatenates  the  first  k 

i”  ID 

elements  of  each  of  the  m  sequences  y9.  e=1  .•••.m.  and  forms  one  long 
sequence. 

On  the  basis  of  the  definition  of  the  multiplexer  operator  it  is  easily 
shown  that  the  following  algorithm  Is  equivalent  with  the  above  definition  of 
the  piping  operator 

IF(f  >mk)  THEN  [pjj,  (7? 1 .  •  •  •  ,7jm)J(f)  =  8 
ELSE 
BEGIN 

Find  the  largest  integer  He<m  such  that  t<ek 
tP^CTj1.- •  -.ymmt)  =  7j®(f-(e-l)k) 

END 

In  the  remainder  of  this  dissertation,  we  will  use  the  abbreviations 

P^_,  _(7?9)  for  P*  (77  , •  •  •  ,7jm),  and  P*  ( 77)  for  P*  (77. •••.7?).  As  will  be 
e-i.m  '  m  m  m 

seen  later,  the  piping  operator  is  very  useful  for  the  verification  of  pipelined 
operation  of  systolic  networks. 

The  Spread  Operator  8s  :  -R  defined  by 


[8°  m>  = 


Hence  8s  inserts  s  8-elements  between  every  two  elements  of  £.  With  the 


,(i±£) 


f  =  1.(S  +1)  +  1 .2(5  +D  +  1 .  •  •  • 
otherwise 


& 


?? 

•  'J 


It 


sequence  £  of  (2.4)  we  have,  for  example 


■> 


■>  V  V*  '  X’> 
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e2{  =  a1 ,0.0.61  .a.0.a2.0.a.b2.  •  •  • 

In  Appendix  A,  we  give  some  properties  about  combinations  of  the  dif¬ 
ferent  sequence  operators.  Those  properties  provide  tools  for  the  manipula¬ 
tion  of  sequence  expressions  and  hence  will  be  used  extensively  in  the 
verification  of  systolic  networks.  In  the  next  section,  we  introduce  the 
notion  of  “Network  I/O  Description*,  which  is  analogous  to  the  transfer  func¬ 
tion  in  circuit  theory. 

2.5.  The  Network  I/O  Description. 

Our  goal  in  this  section  is  to  model  the  computation  of  a  systolic  net¬ 
work  by  describing  the  relation  between  its  outputs  and  its  inputs.  In  order 
to  formalize  this  input/output  relation,  we  start  by  introducing  some  new  ter¬ 
minology.  We  call  “network  output  sequences'  those  sequences  associated 
with  the  IN  edges  of  sink  nodes,  and  “network  input  sequences"  those 
associated  with  the  OUT  edges  of  source  nodes.  Then  the  system  of  all 
node  I/O  descriptions  provides  a  specification  of  the  computation  performed 
by  the  network  in  the  form  of  an  implicit  relation  between  the  network 
input  and  output  sequences.  This  relation  will  be  called  the  "network  I/O 
description". 

As  a  simple  example,  consider  the  hypothetical  network  with  the  graph 
shown  in  Figure  2.6.  in  this  graph,  we  assume  that  the  edges  directed  to 
the  left  are  given  the  color  y  and  those  directed  to  the  right  the  color  x. 

We  also  follow  the  naming  convention  of  Section  2.2  to  identify  the  different 

edges  in  the  graph.  To  complete  the  network  description,  a  node  I/O 

description  has  to  be  specified  for  each  node  in  the  graph.  Assume  that 

these  are  given  by  the  following  causal  relations: 

^q  —  ^  1 


For  node  1: 


(2. 5. a) 
(2.5.P) 
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For  node  2:  €3  =  fi  t2 

For  node  3:  ^  =  n  I  £3  *  ^3  1 


Figure  2.6  -  A  hypothetical  systolic  network 

For  this  network.  7?3  and  1 1  are  the  network  input  sequences  and  is 
the  network  output  sequence.  In  order  to  obtain  the  network  I/O  description 
explicitly,  we  have  to  solve  the  equations  (2.5).  (2.6)  and  (2.7).  that  is.  we 
have  to  obtain  an  explicit  expression  for  vQ  in  terms  of  ( -j  and  ^3. 

Generally,  it  is  very  difficult,  and  sometimes  impossible,  to  derive  an 
explicit  solution  of  the  system  of  node  I/O  equations.  In  the  next  section, 
we  show  that  this  task  may  be  greatly  simplified  in  the  case  of  certain  net¬ 
works  with  a  homogeneous  structure. 

2.6.  Homogeneous  Systolic  Networks. 

By  condition  [A2],  any  edge  eeE  is  uniquely  identified  by  its  color  and 
one  of  its  incident  nodes,  in  fact,  we  have  already  used  this  as  a  con¬ 
venient  means  for  identifying  edges  by  their  color  and  terminal  node.  Let 
M  c  C^xVi  be  the  set  of  ail  pairs  (y.v).  yeC^..  vct^.  for  which  there  is 
an  edge  ecE  with  y=coi(e)  and  v=«p_(e).  Then  the  terminal  node  u=«p+(e) 
is  uniquely  given  and  hence  the  successor  function  n.M  -  u  vy  is  well 
defined  by  the  association 

(y.v)eM.  y=col(e).  v=<p_(e)  -*  u (y ,v)=<pf(e). 

in  other  words,  if  there  exists  an  edge  e  with  color  y  and  starting  node  v. 
then  ti(y.v)  is  the  terminal  node  of  e. 


(2.6) 

(2.7) 
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Given  a  systolic  network  based  on  the  graph  G=  {V.E.w_.#>+).  a  subset 
V/  q  V/  of  interior  nodes  is  said  to  be  a  homogeneous  set  if: 


[HI]  All  the  nodes  in  Vt  have  Identical  IN  and  OUT  degrees,  say  m 
ana  n.  respectively. 

IH 21  The  sets  of  the  m  colors  for  the  IN  edges  of  any  interior  node 
veV/  are  identical.  So  are  the  sets  of  the  n  colors  for  the  OUT  edges 
of  v.  Denote  the  colors  of  the  IN  and  OUT  edges  of  v  by 
y1  .y2.  •  •  •  .ym  and  , •  •  •  ,z  ,  respectively. 

[H3]  The  node  I/O  descriptions  of  any  interior  node  veV^  are  generic 
in  the  sense  that  they  may  be  written  in  the  form: 


'!  ,  =  iV 

(liZ  ,v) 


where  r./  =  1.**-.n  are  given  n-ary  operators  which  are  independent  of 


the  particular  node  in  ,  n  is  the  successor  function  defined  earlier 

in  this  section  and  y1  /  =  !.••  •  .m  and  C1  <=1 .  •  •  •  .n  are  the 

ti(z  .v) 

sequences  associated  with  the  IN  and  OUT  edges  of  v.  respectively. 


A  network  is  said  to  be  homogeneous  if  the  set  of  interior  nodes  in 

its  graph  G  is  a  homogeneous  set.  More  generally,  if  there  exists  a  parti¬ 
tion  Vj  =  u  ’  -  •  u  \/j  of  Vj  into  k  non-empty  homogeneous  subsets 

♦  ,\/y  .  then  the  network  is  said  to  be  k— partially  homogeneous.  The 

main  advantage  of  having  a  homogeneous  (or  partially  homogeneous)  network 
is  that  the  resulting  system  of  equations  has  a  repetitive  pattern,  which,  in 
many  cases,  allows  us  to  obtain  its  solution  analytically. 


As  an  example,  we  consider  the  1-D  convolution  network  described  in 
the  beginning  of  this  chapter.  The  graph  of  this  network  is  shown  in  Figure 
2.7.  where  we  assumed  that  the  edges  directed  to  the  left  have  the  color 
's',  wnile  those  directed  to  the  right  nave  the  color  p'.  The  nodes  of  the 
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graph  are  identified  Py  the  integers  -1.0.1 . k+2.  where  nodes  -1  and  k+2 

are  source  nodes,  nodes  0  and  k+1  sink  nodes,  and  nodes  1  through  k 
interior  nodes.  The  successor  function  is  defined  for  any  interior  node 
i=l . k  by 


(2.1 1) 
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This  is  the  I/O  description  for  the  network. 

In  the  previous  example  we  derived  the  network  I/O  description  for  a 
homogeneous  network.  The  technique  is  equally  applicable  to  k-partiaiiy 
homogeneous  networks  if  k  is  reasonably  small.  In  that  case,  a  system  of 
difference  equations  is  formed  by  writing  the  generic  I/O  description  for  a 

typical  node  from  each  homogeneous  subset  of  interior  nodes  i=l k. 

The  network  I/O  description  is  then  obtained  by  solving  this  system  of  equa¬ 
tions.  The  back  substitution  network  and  the  sorting  networks  discussed  in 
the  next  chapter  are  examples  of  2-partially  homogeneous  networks. 

Finally,  we  note  that  the  system  of  causal  equations  that  models  the 
computation  of  the  different  cells  in  a  systolic  network  is  an  implicit  I/O 
description  for  that  network.  The  derivation  of  an  explicit  formula  for  the 
I/O  description  depends  on  our  ability  to  solve  this  system  of  equation 
analytically.  In  both  its  implicit  and  explicit  forms,  the  I/O  description  is  a 
characteristic  of  the  network  itself  which  is  independent  of  any  particular 
computation  performed  on  the  network.  In  the  next  chapter,  we  associate  a 
certain  computation  with  a  given  network,  and  we  suggest  a  technique  for 
tne  verification  of  this  computation. 
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3.  FORMAL  VERIFICATION  OF  SYSTOLIC  NETWORKS. 

A  computation  on  a  systolic  network  is  defined  by  two  essential  com¬ 
ponents.  Namely,  the  systolic  network  and  the  description  of  its  input.  The 
network  itself  is  characterized  by  its  i/O  description  which  provide  the  gen¬ 
eral  relation  between  the  inputs  and  the  outputs.  However,  in  the  verifica¬ 
tion  of  a  particular  computation,  we  are  usually  interested  in  the  behavior  of 
the  network  for  specific  inputs.  That  is  we  wish  to  verify  that,  for  specific 
inputs,  the  network  will  produce  an  output  with  prescribed  properties. 

Given  the  I/O  description  and  the  input  specifications.  The  verification  of 
the  network's  operation  is  accomplished  by  substituting  the  input  specifica¬ 
tions  into  the  I/O  description,  and  then,  manipulating  the  resulting  equations 
to  obtain  an  explicit  description  of  the  output  sequences.  This  explicit 
description  should  be  in  a  form  that  may  be  compared  with  the  specification 
of  the  expected  output.  In  order  to  clarify  our  technique,  we  consider 
again  the  example  of  the  1-D  convolution  network  whose  I/O  description  was 
found  to  be  given  by  equations  (2.11).  Our  goal  is  to  verify  that  the  net- 
wors  indeed  produces  the  results  in  equation  (2.1)  for  the  network  input 
sequences  described  by 

o1  =  f!*"1  0i  (3.1. a) 

vk  =  et  (3.1.b) 

where 

i(t)  =  0.  Kf  <T(i)=n-(*-l) 

£(f)  =  x(.  l<t*T(()=n 

in  order  to  find  the  corresponding  specific  form  of  the  output 
sequence  a^fy  w®  substitute  (3.1)  into  (2.11)  and  obtain 


•> -a.v  v  ?vwr7.-7v.' ^v.v;  a -t.- «\ 


u> 


p« 


B* 

••v  o. 


$ 

& 


i  b 


fv.* 
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'k  +  1 


*  n2*'1  e  t  t  E^2''1  ^./tl  .  e  <i 


By  the  properties  P3.1.  P4.  P2.1  and  P1.1  in  Appendix  A.  this  may  be 
rewritten  as 

k 


Jktl 


=  n2*-1  e  t  t  n  £  n2</-1)  e  i wk  .  .  .  t) 

/=!  t  ' 

=  rr"  8  t  t  n  ©  e  n'  fw.  ...  . 

/=i  ' 

=  n2*'1  e  t  +  n  0  E  n/_1  7?. 


/*! 


/ 


where  TCn.)=TW=n  and  T7^Cf)  =  ^_/  +  1  £Cf>  =  w^_/fl  *f-  Finally,  apply¬ 
ing  Property  PI 9  of  Appendix  A.  we  find  tha.: 


akf}  =  n2*-1  0  t  +  n  0  n*-1  y 
=  n2k_1  0  ft  +  7>j 
=  a2*'1  0 


where  y  is  defined  by: 

Tty)  =  n-Or-l) 
k 

7/C f)  =  E  7 ).lt+k-j) 

'V 

‘  ?'h,i  'it*-/ 
'~k 


(3.2) 


■  A'«  w. 


Kf  <7(7?) 
?<f<rc  t?) 
i  <r  <r  (77) 


in  the  last  line,  the  summation  index  was  changed  to  q=k-j+1  in  order  to 
provide  for  the  same  expression  as  in  (2.1). 

Evidently,  equation  (3.2)  represents  the  output  of  the  array  in  a  clear 
and  precise  form:  it  indicates  that  after  an  initial  period  of  2k—  1  time  units, 
the  elements  y(t)=y(.  Kf  <n  -(*-1),  will  appear  on  the  output  link,  each 
separated  from  the  other  by  one  time  unit. 

in  we  last  example,  and  in  the  example  of  the  matrix/vector  multiplica¬ 
tion  network  presented  in  the  next  section,  the  verification  procedure  is 


aJaIaS  V.  *-•  V'-?  ^  ‘  -  *  - 


■..-i 


based  on  the  availability  of  an  explicit  I/O  description  which  represents  a 
general  solution  of  the  system  of  equations  that  models  the  computational 
cells  in  the  network.  However,  in  some  cases,  it  is  difficult  to  solve  this 


system  of  equations  in  general,  and  it  is  much  easier  to  solve  it  for 
specific  input  sequences.  In  other  words,  sometimes  it  may  be  difficult  to 
determine  the  explicit  i/O  description  while  it  is  easier  to  obtain  the 
description  of  the  output  sequences  for  specific  inputs.  The  example  of  the 
back  substitution  network  given  in  Section  3.2  illustrates  this  further.  In  this 
example,  we  verify  the  operation  of  the  network  without  obtaining  an  explicit 
formula  for  the  network  I/O  description. 

The  existence  of  an  analytical  solution  to  systems  of  causal  equations 
is  discussed  in  some  details  in  Section  3.4.  where  we  show  that  it  is 
aiways  possible  to  obtain  a  formal  solution  of  such  system  of  equations. 
However,  the  form  of  the  solution  may  sometimes  be  very  complicated  and 
nence  not  practically  useful.  In  such  cases,  we  may  still  verify  the  opera¬ 
tion  of  the  corresponding  network  if  we  have  an  idea  about  the  network 
behavior  and  hence  about  the  sequences  associated  with  the  different  edges 
of  the  graph,  in  fact,  we  need  to  show  that  for  the  given  input  sequences, 
tnu  expected  sequences  satisfy  the  system  of  causal  equations.  We  demon¬ 
strate  this  procedure  in  Section  3.3  by  verifying  the  operation  of  a  sorting 
network  for  which  we  could  not  solve  the  system  of  equations  explicitly. 

3.1.  A  matrix-vector  multiplication  network. 

in  [311.  Kung  and  Lelserson  suggested  a  systolic  network  for  the  com¬ 
putation  of  the  product  y=Ax  of  a  matrix  A  and  a  vector  x.  Here,  we 
modify  this  architecture  for  the  case  where  A  is  a  square,  symmetric  matrix. 
The  goal  of  this  modification  is  to  reduce  the  number  of  input  items  to  the 
network  by  only  supplying  the  elements  in  the  upper  half  of  A .  This  has 


trie  effect  of  reducing  the  I/O  bandwidth  of  the  network  which  is  usually  a 

limiting  factor  in  VLSI  architectures. 

The  modified  network  will  be  formally  specified  using  the  abstract  model 
of  Chapter  2.  Moreover,  the  sequence  notation  will  provide  a  clear  and 

accurate  representation  of  the  input  required  for  the  proper  operation  of  the 
network  including  information  about  the  relative  timing  of  the  inputs  on  the 
different  links.  We  will  also  apply  the  technique  discussed  earlier  of  obtain¬ 
ing  the  network  I/O  description  and  then  of  verifying  that,  with  the  appropri¬ 
ate  input,  the  network  does  produce  the  elements  of  the  product  vector  y. 

In  Figure  3.1.  we  show  the  graph  of  the  multiplication  network  for 

matrices  of  order  k.  It  consists  of  2(k+1)  internal  nodes,  each  labeled  by 
a  pair  U.g).  0</<1.  1  <g<kt2.  The  set  of  colors  C E  has  three  elements, 

namely,  s.  r,  and  z.  and  the  coloring  function  col  0  maps  the  edges  to  the 
colors  as  shown  in  the  figure. 


Figure  3.1  -  The  graph  for  the  matrix/vector  multiplication  network 

The  principle  of  operation  of  the  network  may  be  explained  by  decom¬ 
posing  the  product  vector  y=Ax  into  two  vectors  yu  and  y  such  that 
y=yuty  .  More  specifically,  if  a.  .  and  x.  denote  the  elements  of  A  and  x. 
respectively,  we  write  the  elements  of  yu  and  y  as  follows: 
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0 

/= 1 

(3. 3. a) 

/-I  /-I 

£  ai  I  *j  =  E  a  x 
/= 1  ■'  1  /  =  1  ' 

/  =2.  •  •  •  .k 

*<.i  xi 

/  =  !.  •  •  •.* 

(3.3.b) 

The  elements  of  yu  are  computed  on  the  sub-network  composed  of  the 
cells  corresponding  to  nodes  (0.1),  •••,  (O.k)  in  Figure  3.1.  Similarly,  the 
elements  of  y1  are  computed  on  the  subnetwork  composed  of  the  cells 
corresponding  to  nodes  (1.2),  •••.  (1  .k).  The  cells  (O.k+1)  and  (1,/t  +  l) 
are  delay  cells  that  align  the  corresponding  elements  of  yu  and  y  such 
that  they  can  be  appropriately  added  in  the  cell  (O.k  +2).  The  operation  of 
tne  network  is  formally  specified  by  providing,  for  each  cell,  a  set  of  equa¬ 
tions  relating  its  output  sequences  to  its  input  sequences.  For  the  nodes 
in  the  homogeneous  set  {(O.g)  ;  g  =  1  ,•••.*}  the  generic  causal  equations 
are 


°0.gt1 

=  n  [<70.g  +  p0.g  0  C0.g] 

g  =  l.  •  • 

•  .k 

(3.4.a) 

P0.gtl 

=  n  p0.g 

g =1 .  •  • 

-.k 

(3.4.b) 

C  1.0 

=  n  ^O.g 

g  =  l.  •  • 

•  .k 

(3.4.c) 

where  the  operator  0  is  an  element-wise  operator  obtained  by  extending  the 
usual  multiplication  operator  *  from  ft  to  ftQ  by  adding  the  following  rule: 

60x=x00  =  0  if  x*  0  and  000  =  006  =  0 

Note  that  0  is  not  a  0-regular  operator  in  the  sense  defined  in  Section 
2.1.  However,  because  the  result  of  the  multiplication  of  any  unknown 
number  with  zero  is  zero,  the  operator  0  may  be  naturally  implemented  by 
using  a  standard  multiplication  circuit  and  treating  6  as  a  don't  know  item 
rather  than  as  a  special  symbol.  With  this,  each  node  in  the  above  homo¬ 
geneous  sub-network  represents  a  multiply/add  cell  augmented  by  an 


-.  av- 


appropriate  delay. 


Similarly,  the  cells  corresponding  to  the  nodes  in  the  homogeneous  set 
((l.g)  ;  g  =2.  •••.*)  are  specified  by 


°1.g  +  l  '  n0  °l.g 

P1.gtl  -  n  tp1.g  +  °1,g  0  <l.gJ 


g-2.  •  •  •  M 
g  -2.  •  •  •  ,k 


(3. 5. a) 
(3.5.b) 


Finally,  the  cells  (O.k+1).  d.k+1)  and  (O.k+2)  are  specified  by 


°  O.k+2  =  nk°0.k+l  (36a) 

p0.k+2  =  n  pl.k  +  l  (3.6. b) 

P0,/f +3  =  n  lp0.k+2  +  °0.fc+21  (3.6.0 

The  system  composed  of  the  equations  (3.4).  (3.5)  and  (3.6)  describes 
the  operation  of  the  entire  network.  In  order  to  obtain  an  explicit  form  of 
the  network  I/O  description,  we  should  solve  this  system  and  obtain  a  direct 
relation  between  the  network  output  sequence  of  interest,  namely  PQ  kt3- 
and  the  network  input  sequences  pQ  r  aQ  ^  p1  2<  2  and  CQg- 

g  =  1 ,  *  *  •  ,Ar .  For  this,  we  start  by  solving  (3,4. b)  and  (3. 5. a)  to  obtain 


g  -1 
!  p 

2  (g-2) 


0.1 


'1.2 


g  =  i.  •  •  •  .k 
g-2.  •  •  •  ,k 


We  then  substitute  these  formulas  and  equation  (3.4.c)  into  (3. 4. a)  and 


(3.5.b),  which  gives  the  following  difference  equations 


°0.g  +  1 
pl.gtl 


=  n  £°o,g  +  po,  1  0  ^0 ,g£ 

y  2(g  -2)  y 

=  n  iphg  t  nQ  9  ah2  o  n  c0  gJ 


g  =  l.  •  •  •  .k 
g=2.  •  ■  •  ,k 


(3.7. a) 
(3 . 7 . b) 


The  solutions  of  (3. 7. a)  and  (3.7.b)  may  be  obtained  by  applying  Lemma 
1  in  Appendix  A.  More  specifically,  with  c=2,  s=l  and  r=k+l  in  Lemma  1  we 
get 


O.k  +  1 


=  n 


2k 


0.1 


k 

L 

i  =1 


n2/  ink~' 


;o.i 


C0.k-/  +  1] 


(3. 8. a) 
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while  for  the  values  c=l.  s=2  and  r=k+ 1  in  the  same  lemma,  we  obtain 

k 

pi.*+i  =  n/r_1  pi  ,2  +  n'_1  <n*ar"/)  o1>2  o  n  c0/f./+2]  <3.8.w 

Now.  from  (3.6)  follows  the  network  I/O  description  in  the  form 

P0,/r+3  =  n  pl,/t  +  l  +  n  a0.Atl  <3  9) 

where  p.,  ^  +  1  and  oQ>(  +  .,  are  given  by  (3.8.a/b). 

For  the  proper  operation  of  the  network,  the  input  links  sQ  1  and  r^ 
are  permanently  set  to  zero,  that  is  we  always  have 


J0.1 


•  P 


1.2 


L 


y 

i. 


i 


where  i  is  the  zero  sequence  defined  in  Section  2.1. 
tions  (3.8)  simplify  to 


°0Jf  +  l  E  n  '  fn  '  p0.1  0  ^O./c-ztl1 
'  v 

/-I  2(k-i ) 

Pl.k  +  1  "  n  tfl0  al.2  0  n  ^O.k-/+21 


With  this,  the  equa- 


(3. 10. a) 
(3.10.D) 


in  order  to  perform  the  matrix/vector  multiplication,  the  elements  of  the 
St 

(g-1)  off-diagonal.  Kg <*.  of  the  matrix  A  should  be  supplied  on  the 
network  link  zQ  .  followed  by  an  appropriate  number  of  zeroes.  More 
specifically,  the  input  on  these  links  should  be 

CQ  g  =  n2(ff-1)  ctg  g-1.-  •  •.*  (3.11. a) 


where 


r(V* 
V"  - 


and 

I  Vftg-l 

\o 


t  -g  +  1 
f  >k-g  +  l 


Moreover,  the  elements  of  the  vector  x  should  be  supplied  on  the  links 
rQ  1  and  2  according  to 


o~  ,  =  £ 


(3.1  l.b) 
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°1 .2  =  n0  1 


(3.11.C) 


where  T(£)=k  and  ((t)=xf.  Note  that  s1  ^  carries  the  same  information  as 
rQ  1  delayed  by  three  time  units.  Hence,  by  adding  appropriate  delay  cells, 
we  may  replace  these  two  input  links  by  only  one  link. 

As  the  next  step  in  the  verification  technique  we  have  to  substitute  the 
above  input  sequences  into  the  network  I/O  description  and  to  demonstrate 
that  the  sequence  P0l(f  3  does  contain  the  elements  y. .  i~  1,  •••,/(,  of  the 
product  vector  y.  We  first  substitute  (3.11.a/b)  into  (3. 10. a)  which  gives 


„  ^  _2/'  ,  J<  -/  ,  _  *2(k  -/ )  . 

°o.k+i  J*.  n  ^  €  0  n  a*-/+iJ 

'71  * 

=  n  c  n'  u  0  n*  1  <*...,] 

/=] 

=  n2*  E  <8. 


where  0j  =  n"<*“/)«Oa<|./+v  that  is  Ti6)-k  and 

«(ftk-/)  +  «  Vft/c-/ 

/S.(n  = 


The  element-wise  addition  of  the  sequences  /3  ,  /  =  1 ,  •  *  -  .A  in  (3.12) 


then  gives 


J2k  u 

°0.Atl  *  n  " 


(3.13) 


where 


T(.T)U)=k  and  7?u(f)  =  E  /8  (f).  However,  for  /<r.  we  have  /?.(f)=0. 

/=1  1 


from  which  we  get 


" =  JL  atJ»-i 


t  tk  -/' 


=  A  3'o  = 

where  in  the  last  line  we  replaced  the  summation  index  /  by  q=r+k-/ 


Analogously,  we  substitute  (3.11.a/c)  into  (3. 1 0 . b)  and  find,  after  some 


sequence  manipulation,  that 


if  4.1  / 

s  Cl  T\  (3.14) 

where  T(Tjl)=k  and  7j/tt)=y| 

Finally,  from  (3.13)  and  (3.14)  in  (3.9)  we  obtain  the  output  sequence 


„  +2 
p0./c+3  "  n  71 


(3.15) 


where  T (T))=k  and  77 (f)=yf.  the  r7  element  of  the  product  vector  y=Ax. 
This  verifies  that  the  network  will  indeed  produce  the  elements  of  the  vector 
y  according  to  the  timing  specified  by  equation  (3.15). 

Remark  :  in  some  applications,  as  in  the  one  described  in  section  9.2.  the 
elements  of  the  matrix  A  cannot  be  supplied  at  the  high  rate  specified  by 
equation  (3.11. a).  More  specifically,  we  assume  that  the  input  sequences  on 


the  links  z_.  g  =  1  .•••./(  have  the  forms 

.  2c  (g-1)  c-1 

C0.g  -  n  6  ag 


g  =  1.*  •  •  .k 


for  some  integer  c>1.  In  this  case,  the  network  may  still  operate  correctly 
if  we  change  the  period  of  the  synchronizing  clock  such  that  the  new 
period  is  equal  to  c  times  the  old  one.  An  alternative  solution  is  to 
change  the  delays  within  the  computational  cells  in  order  to  ensure  that  the 
elements  of  the  vector  x  indeed  meet  the  corresponding  elements  of  the 
matrix  A  at  the  appropriate  times.  This  is  accomplished  by  replacing  the 
specifications  (3.4).  (3.5)  and  (3.6)  by 


E 
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3.2.  A  back  substitution  network. 

In  this  section,  we  apply  our  verification  technique  to  a  systolic  network 
that  contains  two  different  types  of  computational  cells,  namely  the  back- 
substitution  network  suggested  in  (311.  This  network  performs  the  back  sub¬ 
stitution  operation  to  solve  the  linear  system  of  equations 

L  u  =  y  (3.16) 

where  L  is  an  n  xn  non-singular,  banded,  lower  triangular  matrix  with  the 
(haif)-band-width  k+l.  and  y  is  a  given  n-dlmensionai  vector.  The  solution 
of  the  linear  system  (3.16)  is  given  by  the  formula. 


,  y  /  I  . 
[  '  /-' i 


u  =  <  (y  -  r  I.  .  .  u.  .)//.. 
i  1  i  i.i-i  i-i  i  .i 

j  1  ~k 

K  (y.  -  r  i.  .  .  u.  .)  /  i.  . 
I  I  -I  -I  l-l 


2 


k  <1  <n 


where  I.  .  is  the  </./;  element  of  the  matrix  L,  and  y.  and  u.  are  the  ith 


elements  of  the  vectors  y  and  u.  respectively. 


Figure  3.2  shows  the  graph  of  the  suggested  network.  it  is  a  2- 
partiaiiy  homogeneous  network,  composed  of  k  multiply/add  (M/A)  type  ceils. 


Figure  3.2  -  The  graph  for  the  back  substitution  network 
and  one  subtract/divide  (S/D)  cell.  The  computational  cells  are  labeled  by 
integers  such  that  the  cells  1  through  k  are  of  the  M/A  type,  and  the  cell 
0  is  the  S/D  cell.  As  for  the  I/O  cells,  we  must  be  careful  to  assign 
labels  to  the  sink  cells  because  these  labels  will  be  used  to  identify  the 
network  output  links.  The  labels  given  to  source  nodes  are  immaterial  as 
they  do  not  affect  the  verification  procedure,  and  consequently  are  not 
shown  in  Figure  3.2. 

in  the  regular  layout  shown  in  Figure  3.2.  the  edges  directed  to  the 

south,  north,  east  and  west  are  given  the  colors  a.b.r  and  s.  respectively. 

The  set  Vt  of  interior  nodes  in  G  is  divided  into  two  homogeneous  subsets 
1  2 

vj  -(0)  and  v  s(i  ;/=i. . .  .  4k),  The  operation  of  the  cell  represented  by 
node  '0'  is  described  by  the  causal  relation 

P,  =  n  (L80  -  oQ)  /  <Xq]  (3.17) 

and  the  operation  of  any  M/A  cell  represented  by  a  node  i.  1</Sk.  is 
described  by  the  generic  I/O  description 

p(+1  =  fl  p.  /  =  1 ,  *  *  *  ,A  (3. 18. a) 

o._1  =  fl  ter  9  a.  *  p.]  /  =  1 .  •  •  •  ,k  (3.18.0) 

where  the  0  was  defined  by  the  formula  (2.2). 


39 


In  order  to  solve  the  system  of  difference  equations  (3.17).  (3.i8.a/b). 
we  first  write  the  solution  of  (3. 18. a)  as 

pj  =  n,_1  p1  !</<fc+l  (3.19) 

from  which  we  find  that 

Pki.  i=  n*  (3.20) 

Substitution  of  (3.19)  into  (3.18.b)  then  gives 

Oj_. ,=  n  la.  ©  Ay]  (3.21) 

where  Ay  =  a;.  *  [lV  1  p^.  Using  an  inductive  argument  similar  to  that  in 

Appendix  A  for  the  proof  of  Lemma  1.  we  can  show  that  the  solution  of 

(3.21)  is 

k  *  -1 

an  =  n  a.  ©  £'  Cl1  let,  •  Cl1  pJ  (3.22) 

0  *  /=1  /  1 

k 

where  £'  is  defined  by  £'  t\ .  =  7),  +  +  ....  +  7?.. 

/=1  '  1  *  * 

For  given  p^.  the  network  output  sequence  P^+1  is  easily  obtained 
from  (3.20).  The  next  step  will  be  to  eliminate  oQ  from  (3.17)  and  (3.22) 
and  to  obtain  p1  explicitly  in  terms  of  the  network  input  sequences  a k.  0Q 
and  a.,  /*0. •••.*.  Unfortunately,  if  we  try  to  solve  (3.17)  and  (3.22) 
simultaneously,  we  will  obtain  a  recursive  equation  in  p.j.  which  is  very  dif¬ 
ficult  to  manipulate  in  general.  For  this  reason,  we  consider  only 

specific  forms  of  the  network  input  sequences,  namely  those  required  for 
the  proper  operation  of  the  network.  They  are  given  by 

_k  +/_  ,  .  _ 

<*,  =  n  9  X.  /  =0,  -  •  •  .k 

0Q  =  n*  e  V 

°k =  6  1 


with  T(Xy)  =  n-i.  TU )  =  T  (77)  =  n  and 


(3.23. a) 
(3.23. b) 
(3.23.0 


1  <f  <n  -i 
KKa 
Kf  <n 


w 


x.(n 

7?(n 

ttt) 


*  / 


ft/.f 


=  yf 

=  o 


Substituting  (3.23)  into  (3.17)  and  (3.22).  we  find  that 

P1  =  n  HD*  0  t?  -  oQ]  /  n*  0  XJ  (3. 24. a) 

a0  =  n*  e  i  ©  E‘  tn2/+*  ex  -  n2/_1  pj  (3.24.b) 

/*  i  '  1 

Since  0-x=0  for  any  x€flfl.  (3. 24. a)  implies  the  existence  of  a  sequence  £ 
such  that 

P1  =  f/  +  1  e  £  (3.25) 


whence,  by  (3.24.b).  we  find  that 
k  k  i 

o0  =  n  e  u  ©  e*  n'  tx.  -  £]  i 

/=i  ' 

2/ 

where  we  used  property  P4  to  interchange  D  and  9.  If  in  addition  we  let 
k  i 

y  *  t  ©  E*  n;  [X  .  -  £3  (3.26) 

/=1  7 

then  we  can  substitute  for  oQ  and  p1  in  (3. 24. a)  and  obtain 

nAtl  e  £  =  n  tin*  e  7?  -  n*  e  y)  /  n*  e  xQ] 


which  reduces  to 


£  =  ft?  -  y]  /  XQ  (3.27) 

For  an  explicit  description  of  the  sequence  ?.  we  need  to  examine 
(3.26)  more  closely.  We  start  by  evaluating  the  product  term,  namely 

n'  (x/  -  £3  =  n'  fij 

where 


T  (M ,) 


min(  T(X,.)  .  f  (£))  <  n-j 


(3. 28. a) 


H^t)  =  x.<n  *  nt) 


(3.28. b> 


This  enables  us  to  rewrite  (3.26)  as 


y  =  t  ©  E'  n' 

/=  i  ' 


(3.29) 


From  (3.29)  and  the  definition  of  the  1®'  operator,  we  conclude  that 

TO)  =  maxfTU)  .  TUijt+i)  =  n.  and  hence  from  (3.27)  that 

T(£)  =  minfT (tj)  .  TO)  .  T(XQ)f  =  n. 

With  this  in  (3. 28. a),  it  is  easily  seen  that  7  (/t .)  =  n  -/ .  Now.  we  apply 

property  P20  to  (3.29)  and  explicitly  describe  >  by 
TO)  =  T(i)  =  n 


0 

f-1 

7(f)  =  E 

,k1 

[  E  M  :  (f  ~i ) 

*  /  =  !  ' 


f  =2.3 . k 


f=X  +  l.X+2 . n 


Then  finally,  with  these  specific  descriptions  of  17.  XQ  and  7.  the  expli¬ 
cit  form  of  the  sequence  t  in  (3.27)  is  found  to  be 


tit)  =  (7?<n  -  7(f))  /  xn(f). 


that  is 


yt  /  'r.f 


4(f)  =  ]  %  -  L  tu-i>  Vt-/; 


<yr  -  E,  'r.rV  /  Vf 


2^f 


*  +  KK/i 


A  comparison  of  this  expression  with  the  formula  given  in  the  beginning 

of  the  section  for  the  solution  of  (3.16)  shows  readily  that 

-2X  + 1  _  , 

0»H  ■  "  9  1 

wnere  Tit)  =  n  and  4(0  -  uf. 
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3.3.  A  sorting  network. 

The  sorting  network  [321  described  here  accepts  an  indexed  set 

X=U1  .•  •  -.xk)  of  k  different  real  numbers.  x.eR.  ieK=n k).  and  produces 

as  output  the  same  numbers  sorted  in  ascending  order.  Figure  3.3  shows 
the  general  graph  of  the  network  and  the  labels  given  to  each  node,  in 
the  figure,  the  edges  directed  to  the  right  and  left  are  colored  p  and  s. 
respectively. 

For  any  jeK.  let  be  the  result  of  sorting  the  j  elements 

*■)••••.*,  of  X  in  ascending  order.  Then  for  ail  (i.j)  of  D={(i.j)eK  xk  ; 
1  </</'<*).  the  ranking  function  f^.D-X  is  defined  by  fx</./)=y.. 

With  this,  we  will  prove  that  if  the  network  input  sequence  is  given 
by 


Figure  3.3  -  The  graph  for  the  sorting  network 

nH  =  Q  t  (3.30) 

where  T(|)  =  k  and  £(t)  =  x(.  then  the  network  output  sequence 
has  the  form 

okt]-  n2*-1  e  t)  (3.3D 

7)(f)  =  tx(t.k). 


where  T  (7))  =  k 


and 
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The  network  considered  in  Figure  3.3  is  a  2— partially  homogeneous  net¬ 
work.  The  cell  labeled  '1'  is  a  simple  latch  cell  whose  operation  is 

described  by 

o2  -  n  ?r1  (3.32. a) 

while  the  I/O  description  of  the  cells  i=2 . k  is  given  by 

=  n  max &in  ..a.)  (3.32. b) 

°/tl  =  n  min8(7r..a/)  (3.32.C) 

where  max^  and  min^  were  defined  in  section  2.1.  in  other  words,  the 

cells  i=2 k  are  comparlsion  cells  which  operate  as  follows:  At  any  time  t. 

if  neither  one  of  the  two  inputs  o.it)  or  ir.it)  is  a  don't  care  element  6. 
then  the  cell  compares  the  two  inputs,  and  produces  as  output  at  time  Hi. 
the  largest  and  the  smallest  of  the  two  numbers  on  the  links  p.  and 
s/+1  respectively.  However,  if  any  of  the  inputs  Is  6.  then  the  cell  acts  as 
a  simple  latch  ceil,  that  is.  if  of(t)=0  or  ff;.(f)*0  then 

77/. _  1  (f 1 1 )  =  nit)  and  o.  +  1(ttl)  x  oTf) 

To  obtain  the  network  I/O  description,  the  system  of  equations 
(3.32.a/b/c)  should  be  solved  for  However,  the  recursive  nature  of 

(3.32.b)  and  (3.32  c)  makes  this  very  difficult,  if  not  impossible.  One  possi¬ 
ble  alternative  is  to  suggest  a  tentative  value  for  the  sequences  nj  and  a., 
and  then  to  verify  that  these  suggested  solutions  indeed  satisfy  (3.32).  Of 
course,  any  assumed  value  for  v.  should  reduce  to  the  input  sequence 
(3.30)  for  i=k. 

Let  us  assume  that  n.  and  o.  are  given  by 

v.  =  nk~'  e  a.  1  </•<*  (3. 33. a) 

O.  =  n*+'"2  6  0  2</<k  +  l  (3.33.b) 

/  I 

where  T  ia)  =  T  {0 .)  -  k . 


ap)  = 


{  '' 

1  max(xt.fx  it-i.t- 


W 


1  <t  <i 
i  <t  <k 


and 


0,<t)  = 


I 


f  xit  .t+i-2) 


l 


fxitk) 


Kr<k+i-/ 

*  +  1-/ <f  Oc 

It  is  very  easy  to  verify  that  <3.33.a)  reduces  to  (3.30)  for  i=k.  Hence,  our 
next  step  will  be  to  check  that  (3.33)  does  satisfy  (3.32).  For  i=1,  (3.33.a) 
reduces  to 

Jt- 1 


7r  i  =  n 


e  a 


l 


where  T(a^)=k.  and 


a1  tf)  = 


maxfl(xrfjf  (f-i.r-i); 


f=i 
1  <f 


Since  fx</./)  is  the  maximum  element  in  Cx]  .Xg,  •  •  ■  .Xj).  it  follows  that 


X)  -f x  (1,1)  and  max^(x^.f^(f-1.f-l)i=f^(f,f).  Hence,  we  may  write 


a  1(f)  =  fx (t.t)  KKfc 

But  from  (3.33.0).  we  obtain  for  i=2 
a2  =  n*  6  02 

where  T(02 )  =  k  and  =  fxit,t)l  which  proves  that 

>3 2  =  «-j.  and  hence  a2  =  n  Tr-j. 

The  next  step  is  to  show  that  (3.33)  does  satisfy  (3.32.b).  For  this,  we 
subtitute  (3.33)  into  the  right  hand  side  of  (3.32. b)  and  denote  the  resulting 
sequence  by  p.  This  gives 

p  =  n  maxfi(n*~'  9  a.  ,  nkfl~2  9  tf(.)  2</<k 

Using  property  P4  to  interchange  D 20  and  9  in  the  second  operand  of 


max.  we  obtain 


(3.34) 


Of-O- 1)  „ 

P  =  n  9  y. 


where  y .  =  max^fa^.  .  fY  £(,>.  By  definition  of  maxfl.  it  follows  that 


T  (7 .)  =  7"  (a^)  =  k .  and 

j  ai(t) 
yilt>  -  {  , 

f.  rr\A*(n 


m  a  x(a.  (t ) .  0.  (t  -i  + 1 )) 


KKi-1 


/ — 1 <f Of 


Hence  with  the  definitions  of  a^t)  and  £.(f)  we  obtain 


yXt)  =  jmaxOf  .  (f -/  +1  .f-1)) 


lOO- 1 


max(max(xt  .  .  fx(t-i+l.t-l))  i<t<k 


Because  of  max{  max(a.b)  .  c)  =  maxfa.b.c).  and  f^tt-i.f-l)  <  f  (t- 


i+l.t-1).  we  may  rewrite  as 


J  x  lOO- 1 

y.Ct)  =  j 

!•  maxGr(  .  f  (t-(i  /-l<rO( 

from  which  we  find  that  y.U)  -  a(. _1(f),  and  hence,  because  of  (3.34)  and 
(3. 33. a),  that  p  =  ir._y  This  proves  that  (3.32. b)  is  satisfied  for  the 
values  of  o.  and  n  ■  given  by  (3.33). 

Finally,  to  check  that  (3.33)  does  satisfy  (3.32.C).  we  substitute  (3  33) 
into  (3.32.0  and  denote  the  resulting  sequence  by  r.  This  gives 


_  _ k  -i  „  +/  -2  „  „  , 

r  =  fl  mm.fn  9  a.  ,  ft  9  0) 
.  .  ,o  i  .  ,  i 

k  —i  + 1  .  ,  -1  .  . 

=  fi  9  min.fa.  ,  fl  0) 

o  i  i 


2*i  Of 


in  view  of 


minflfa/.  .  fY  1  0f)  ~  fY <p< 


where  T (ip.)  =  T  ( 0  )  -  k  and 


'P.(t)  =  -j 


mm(a,(f  +/  -l),£,(f » 


/3.(n 


1  O  Of  -(/“!) 


k  -(/  -1X1  *k 


we  write 


(3.35) 


“L1:*,.  v*  ’>  V."  v;.-  r.  v. . 


,  A  +(/  tl)-2  ^ 
r  -  Cl  Q  <p. 


From  (3.35)  and  (3.32.C).  it  follows  that  r  =  o/  +  1  only  if  <p.  =  £  To 
prove  this,  we  substitute  the  definitions  of  a^rtz-l)  and  ^(f)  into  <p.(f) 


and  obtain 


r  mind 

*/“>  = 

L  f  (t. 


minfmax(xf+/_1  .  fx(t-l,t+/-2W  ,  t*  It .t+i-2))  l<Kk-(/-l) 


Ac  — </' -lXf  <Ac 


But  from  Lemma  5  in  Appendix  A,  and  the  fact  that  (r .r-t-z-l)  =  cr .A ) 


for  t=k-i+1.  we  may  write  y^ft)  as 
I  f„(f.f+/-l) 


I  fx(f.f+ 

•/«)  *  1 

1  fAt.k) 


1 <f  <k  -/ 


k  -i  <t  <k 


it  follows  that  vyn  =  4/+1(f>  and  therefore  that  r  *  o/+1.  This  completes 
the  proof  that  the  sequences  n.  and  aj  of  (3.33)  indeed  satisfy  the  system 
of  equations  (3.32). 

Now  that  (3.33.b)  is  known  to  be  a  valid  formula  for  the  sequence  ar 
we  can  easily  obtain  the  network  output  sequence  oA  +  1  by  setting  i=k+l. 
This  gives 

°*t.  ■ n2*'' e  *»♦, 

where  T(^  +  1)  =  k  and  ^f^(f)  =  f XU .k ).  KKk  which  is  identical  with 
the  expected  output  sequence  (3.31). 

After  illustrating  our  verification  technique  by  various  examples,  we 
investigate  in  the  next  section  the  solvability  of  systems  of  causal  equations. 
Clearly,  this  is  a  crucial  issue  determining  the  general  applicability  of  the 
technique 


3.4.  Analytical  Solution  of  Systems  of  Causal  Equations. 

In  this  section  we  discuss  the  existence  of  analytical  solutions  of  sys¬ 

tems  of  causal  sequence  equations.  Here  we  use  the  term  sequence  equa¬ 
tion  in  a  restrictive  manner  to  indicate  an  equation  in  which  the  left  side  is 
a  sequence  and  the  right  side  is  a  sequence  expression.  This  is  the  only 
type  of  equations  needed  for  modeling  the  operation  of  systolic  networks. 

We  begin  by  defining  the  dependency  matrix  which  describes  the  struc¬ 
ture  of  a  given  system  of  sequence  equations.  Then  an  iteration  operator  / 

is  introduced  and.  with  the  help  of  this  operator,  it  is  shown  that  the  solu¬ 

tion  of  any  system  of  causal  equations  can  be  expressed  in  analytical  form. 
Finally,  we  present  some  examples  that  demonstrate  the  applicability  of  the 

iteration  operator  for  the  analytical  verification  of  systolic  networks. 

3.4.1.  Definition  and  Existence  of  Analytical  Solutions. 

in  order  to  discuss  systems  of  equations  on  sequences  without  referring 
to  the  network  that  are  modeled  by  these  equations,  let  Q  denote  the  set 
of  an  sequences  that  appear  in  a  given  system  of  sequence  equations. 
This  set  0  is  partitioned  into  three  disjoint,  mutually  exclusive  sets,  namely, 
tne  set  of  input  sequences  <3  .  the  set  of  output  sequences  Qq  and  the 
set  of  intermediate  sequences  0r  Here,  an  input  sequence  is  a  sequence 
that  does  not  appear  on  the  left  side  of  any  equation  in  the  system,  an 
output  sequence  is  a  sequence  that  does  not  appear  on  the  right  side  of 
any  equation  in  the  system,  and  an  intermediate  sequence  is  a  sequence  in 

Q  which  is  neither  in  0  nor  in  0  . 

P  o 

Accordingly,  a  solution  of  the  given  system  of  sequence  equations  is 
defined  as  a  set  of  formulas,  involving  only  well  defined  sequence  operators, 
that  explicitly  describe  the  sequences  in  Qq  in  terms  of  those  in  Qp. 
Here,  a  well  defined  operator  is  understood  to  mean  any  operator  whose 


image  can  be  obtained  from  its  operands  using  a  deterministic  expression. 
The  operators  defined  in  Chapter  2  are  examples  of  well  defined  operators. 


Let  qp,  qQ  and  qf  be  the  cardinalities  of  the  sets  Qp.  Qq  and  Qr. 
respectively.  We  enumerate  the  sequences  in  0  by  integers 
/  =  !,•••  such  that  for  any  sequence  i  .e.Q . 


€<?0 

if  l<qQ 

€Qr 

if  %<i*q0+*r 

«/ 

if  qQ+qr<i<q0+qr+q 

The  structure  of  the  system  of  equations  can  then  be  described  in 
terms  of  the  dependency  matrix  A  which  is  a  binary,  square  matrix  of  order 
with  the  elements 

1  if  ^ appears  on  the  right  side  of  the  equation  describing 

a.  .  = 

1 '  0  otherwise. 

For  example,  consider  the  following  two  systems  of  sequence  equations. 

System  S 

*1  =  ri(*3  '  V 

1 2  =  ^2^4  '  ^5^ 

^3  =  ^3^4  '  ^6^ 

U  =  r4(*5  •  i7) 

«5  =  r5(*6  •  V 

System  S 

This  is  the  same  as  system  S  except  that  the  last  equation  is  replaced  by 
^5  =  ^5^3  '  ^6  ' 

Here  I\.  ;=!,•••, 5  and  F,  are  sequence  operators. 

/  O 

in  both  systems,  we  have  Qp  =  {£g.£7).  Q0  -  and  = 

(4^4)  and  the  dependency  matrices  are 

O  O 


v*  V 


sr--‘  > 


■ 

«  * 

to 

' 

m  m  M 
»  "  - 

•  «  * 
>  V 

m 

1  Q i 

,*■> 

•v 

s 

fsA 

..  ^  ^ 

*1 

► .  - . 

‘V-T  '•- 

^  V. 


*  ■  'I  I  m  . 

te 

>:*  . 

■  v  *, 

>;■••  •> 
-•* 

>>■• 

y 

*  •  * »  •  » 


o  o  i  o  o  i  o' 

OOOl  100 
0  0  0  1  0  1  0 
0  0  0  0  1  0  1 
0  0  0  0  0  1  1 
0  0  0  0  0  0  0 
0  0  0  0  0  0  0 


0  0  1  0  0  1  0 
0  0  0  1  1  0  0 
0  0  0  1  0  1  0 
0  0  0  0  1  0  1 
0  0  1  0  0  1  1 
0  0  0  0  0  0  0 
0  0  0  0  0  0  0 


(3.36) 


for  S  and  S.  respectively. 

From  the  definition  of  the  input,  output  and  intermediate  sequences,  it 
is  clear  that  any  dependency  matrix  A  can  be  partitioned  into  the  following 
form: 


[O  A  A  I 

o.r  o.p 


*  =  0  Vr  Vp 


OOO 


where  the  dimension  of  the  sub-matrices  A  .A  .  Ar  „  and  A  „  are 

O.r  o.p  r  .r  r.p 

qpXqr'  VV  qrXqr  and  qr*qo‘  resPect,v®*y-  and  each  0  denotes  a  zero 

sub-matrix  of  the  appropriate  dimension.  This  decomposed  form  of  the 

matrix  A  shows  that  our  ability  of  expressing  explicitly  the  sequences  in  Oq 

in  terms  of  those  in  Qp  depends  only  on  the  structure  of  the  submatrix 

Ar  r.  In  other  words,  if  Af  r  is  a  strictly  lower  or  strictly  upper  triangular 

matrix,  then  by  back  substitution,  we  can  easily  express  the  sequences  in 

O  in  terms  of  those  in  Q  .  This  in  turn  enables  us  to  relate  explicitly 
r  P 

the  sequences  in  Qq  to  those  in  Qp  for  any  form  of  the  submatrices  Aq  r 
and  Aq  .  For  example,  the  matrix  Ar  f  corresponding  to  the  matrix  A  in 
(3.36)  is  strictly  upper  triangular.  Hence,  for  the  system  of  equations  S.  we 
obtain  by  back  substitution 


^4  =  ^4^  ^*5^6*^ 7^ 
*3  =  r3(  A4(^6'^7) 


VMt* 

/S3(^6'^7) 


which  leads  to 


£]  -  iy  Ag({g,^7>  .  4g)  = 

h  =  r2(  •  V*6  <7,)  =  A2(*6  h> 
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where  the  operators  A^  i=l . 4  are  defined  in  terms  of  the  known  sequence 

operators  IV.  and  hence  are  themselves  well  defined  sequence  operators. 

It  should  be  noted  that  the  structure  of  the  matrix  Ar  depends  pri¬ 

marily  on  the  numbering  of  the  sequences  in  Qr  More  specifically,  when¬ 
ever  there  exists  a  numbering  that  results  in  Ar  being  strictly  upper  or 

strictly  lower  triangular,  then,  as  stated,  it  is  possible  to  solve  the 

corresponding  system  by  back  substitution.  This  situation  applies  only  if  the 
system  of  causal  equations  does  not  contain  any  direct  or  Indirect  recur¬ 

sion. 

On  the  other  hand,  if  the  system  of  equations  does  contain  recursion, 
then  for  any  numbering  of  the  sequences  in  Q^.  the  matrix  A^  r  cannot  be 
strictly  upper  or  lower  triangular,  and  hence,  the  simple  back  substitution 

scheme  cannot  be  carried  to  completion.  For  example,  in  the  system  of 

equations  S.  we  cannot  express  the  sequences  £3  and  £4  in  terms  of 
and  t7  unless  we  have  a  method  of  solving  the  coupled  equations 

*4  =  A4(*3'^6'*7} 

^3  -  A3(^4'^6) 

where  Ag  and  A4  are  well  defined  operators. 

Yet.  in  the  special  case  when  the  operators  A3  and  A4  are  causal 
operators,  it  is  possible  to  calculate  the  sequences  $3  and  £4  for  any 
given  specific  sequences  4g  and  t7-  In  other  words,  the  equations 

(3.37.a/b)  have  always  a  solution.  This  is  an  indication  that  our  inability  to 

express  this  solution  analytically  is  due  to  the  fact  that  our  notation 
suppresses  the  time  dimension  from  the  sequence  equations.  This  motivates 
the  introduction  of  the  iteration  operator. 


(3.37.a) 
(3.37. b) 


3.4.2.  The  Iteration  operator. 


it  can  be  easily  shown  that  the  solution  of  any  coupled  system  of 
equations  may  be  obtained  if  we  have  a  means  of  solving  recursive  equa¬ 
tions  of  the  form 

C  =  r<C.£ y  -  •  • ,in )  (3.38) 

where  r  is  some  sequence  operator.  For  example,  the  solution  of  the  cou¬ 
pled  system  (3.37)  may  be  obtained  if  we  can  solve  the  recursive  equations 
resulting  from  the  substitution  of  (3.37.b)  into  (3. 37. a),  namely 

<4  1  =  "«4-£6’«7) 

In  general,  the  solution  of  (3.38)  may  not  be  well  defined.  However, 
systems  of  sequence  equations  resulting  from  modeling  systolic  networks 
have  the  nice  property  that  they  contain  only  causal  operators.  Hence,  we 
will  consider  (3.38)  only  for  causal  operators  T. 

Theorem  3.1:  Given  a  causal  operator  r:[fl0)n+1-fl0 .  the  solution  C  of 

C  =  r«.$v-  •  *.tn)  (3.39) 

is  well  defined. 

Proof:  We  prove  this  theorem  by  means  of  the  following  procedure  for  the 
computation  of 

ALG1 

1)  Let  <*0  =  0  . 

2)  FOR  k  =  1.2. •  •  •  DO 

2.1)  Compute  the  sequence  a k  as  follows 

a^_1(r)  t  <k 

V'>  =  f=* 

^  0  t>k 

2.2)  Set  <(/o  =  a  Ak). 


-»  .»  _•  V 


in  order  to  prove  that  the  sequence  C  computed  by  ALGl  satisfies  (3.39) 
we  define  the  step  operators  for  *=0.1.2,  •••  by 

Sn  C  =  o' 


and.  for  *>0,  by 


f sk  CUt)  = 


If  Kf<* 


if  t>k 


With  this,  it  is  directly  seen  that,  for  any  t.  afc(f)  =  IS^CKt)  and  hence 
ak  *  S^C-  From  ALGl.  we  then  have 


C (f)  «  af(f)  =  tr(Sf_1  C.CV*  •  •!„)]«) 


(3.40) 


However,  the  definition  of  causality  implies  that  ((f)  may  depend  only 
on  any  element  CIO)  with  l<t:  that  is,  we  may  replace  C  in 

(3.40)  by  C-  This  gives 

CO)  =  ir(C.Cr*  •  *.4n)10) 

and  proves  that  the  sequence  C  computed  by  ALGl  indeed  satisfies  the 
equation  (3.39).  ■ 

Theorem  3.1  proves  the  existence  of  a  solution  of  recursive  causal 
equations  and  gives  a  procedure  for  Its  computation.  Our  next  goal  is  to 
provide  a  suitable  notation  for  expressing  this  solution. 

Definition:  Let  r:[fl0]n  +  1-ft0  be  a  given  causal  operator.  The  iteration 
operator  I  applied  to  the  image  sequence  r Ciq  ^ .  •  •  •  .»  .)  with  respect  to 

any  of  the  arguments  vr.  l<r<n+l  shall  be  defined  by 

C  -  y  rni,.- •••V  1> 

where  for  any  t 

C(f)  *  tr(yy-  •  -.c.-  ’•.ynfl)Ht) 


%  *  *‘«V*  ,*• 


s"  \  */  v*  *.*  ‘i " 


\  **  .*  •*  •*  ■«*,*,  .  .  •  *  •  *  « 
•  •  -v* » . •• 
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Using  a  procedure  similar  to  the  one  given  in  the  proof  of  Theorem 
3.1.  we  can  show  that  the  image  sequence  <  in  the  above  definition  is  well 
defined.  Note  that  the  sequence  vr  in  the  combined  operator 

'j)  is  a  dummV  seQuence  which  is  needed  only  to  indicate  the 

argument  of  r  to  which  the  recursion  is  applied.  In  other  words,  the  argu¬ 
ments  of  I  T  are  only  tj,  ••*.  7?  .,77  With  this  definition. 

7 }  I.  r  —  I  r  t  I  nr  l 

we  can  now  prove  the  following  theorem 

Theorem  3.2:  For  any  causal  operator  ,  the  solution  of  the 

recursive  equation 

is  given  by 

c  =  /,  r(n.iv....«n) 

Proof:  From  the  definition  of  the  iteration  operator  we  obtain,  for  any  r*l 
C(t)  =  [/  r(7?.«v*  •  -  .«n)](f) 

=  tr«.  i  v  ••  •.«„>!<» 

which  directly  implies  that 
<  =  r(C.4r- •  *.«„)•■ 

Theorem  3.2  provides  a  means  for  expressing  the  solution  of  recursive 
causal  equations.  Its  application  to  the  verification  of  systolic  networks, 
however,  depends  on  our  ability  to  manipulate  expressions  that  combine  the 
iteration  operator  and  other  sequence  operators.  The  following  theorem 
provides  the  basis  for  such  a  manipulation. 

Theorem  3.3:  If  is  a  causal  sequence  operator,  and  js 

any  sequence  operator  with  the  property  that 
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•M 


A(4>£.  <*>^v  **  *.4>^)  =  4>  A (3.41) 
where  A'  may  or  may  not  be  identical  to  A.  then 

lv  A(7?.4>4r  •••.<*>{  fl)  =  4>  I  A' (7 htr' (3.42) 

Proof.  We  write  the  right  side  of  equation  (3.41)  as  $  y.  where  y  is  given 
by 

*  =  'n i 

By  Theorem  3.2.  we  know  that  y  also  satisfies 
y  =  ,v  A'd y.lv- ••«„): 
that  is 

4 >  y  =  <&  l  A'(y.{,.*  •  •.£„) 
i)  l  n 

By  the  hypothesis  (3.41)  this  reduces  to 
<t  y  ~  I  A (<&?.#{,.•••  ,4>£„) 

7?  l  /7 

which  by  Theorem  3.2  has  the  solution 
<X>  y  =  /  A(77.<${  ^  •  ,4*^) 

Evidently,  this  is  equal  to  the  left  side  of  equation  (3.42).  ■ 

We  next  give  some  examples  that  illustrate  the  applications  of  the  itera¬ 
tion  operator  to  the  verification  of  systolic  networks. 

3.4.3.  The  iteration  operator  in  the  verification  of  systolic  networks. 

in  this  section,  we  present  two  examples  for  the  application  of  the 
iteration  operator  to  the  derivation  of  the  I/O  description  of  systolic  networks 
that  are  modeled  by  mutually  coupled  systems  of  equations.  The  first  net¬ 
work  is  the  back  substitution  network  that  was  verified  in  Section  3.2  for 
specific  input  sequences.  Here,  we  will  derive  an  explicit  I/O  description  for 
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this  network  and  show  that  the  iteration  operator  does  simplify  the  verifica¬ 
tion  of  this  network.  The  second  example  illustrates  an  important  fact, 
namely  that  even  though  it  is  always  possible  to  obtain  analytical  formulas 
for  the  I/O  description  of  systolic  networks,  those  formulas  may  sometimes 
be  very  long  and  cumbersome  to  a  point  that  they  are  complicating  the 
verification  procedure  rather  than  simplifying  it. 

Example  1: 

Consider  the  back  substitution  network  of  Section  3.2.  in  that  section, 
we  did  not  obtain  explicitly  the  network  I/O  description  because  of  the  cou¬ 
pled  equations 


Pl  =  n  tW0  -  aQ ]  /  aQ } 

.  k  , 

oq  =  nKak  a  *  ~i~> 


r  d  ta.  *  n'  pJ 
/=i  1 


(3. 43. a) 
(3.43.b) 


However,  by  substitution  of  (3.43.b>  into  (3. 43. a)  we  may  obtain 

k 


P]  =  n  u/?0  -  trfo^  ©  e'  n/ta/  *  n'  ’p^n  /  a 


j-y 
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which  by  Theorem  3.2  has  the  solution 


,  r\  .  .  . 

P]  =  iu  n  (W0  -  tnKcrA  @  r  n ,iaj  *  n'  /  aQi 


/■=! 
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and  leads  directly  to  the  explicit  I/O  description 
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k  +1 


=  n *  /,,  n  n$n  -  [n kalt  o  E  n^a,.  *  n/-1wjn  /  a0i  (3.44) 


u 


/=  1 
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Equation  (3.44)  describes  the  output  sequence  P#(  +  1  in  terms  of  the 
input  sequences  a.,  /=0,  •••.*,  and  a k-  In  order  to  verify  the  network  for 
the  specific  input  described  by  equations  (3.23),  we  substitute  these 
sequences  in  (3.44).  This  provides 

H 

P_,  =  n*  /  n  t[n*e7?  -  m*ei  ©  e'  n'[n*f/ex  *  n''1^]]]  /  n*ex  i 

ATI  t»/  /“"l 


and  now  the  application  of  Theorem  3.3  for  factoring  n  e  from  the 
operand  of  /  gives 


*  n2*+1e  III  71  -  It  9  E'  n'tx,.  *  U)}]]  /  XI 


K  1 1  (J 

=  n2/f+1  e  * 


/=i 


where 

k 

at)  =  (7?(f>  -  den  e  £'  i  n'tx.  -  out)  »  /xn(f> 

/=i  1  u 

This  leads  directly  to  the  expression  for  lit)  derived  in  Section  3.2. 

EXAMPLE  2: 

In  this  example  (Figure  3.4),  we  consider  the  special  case  k=3  of  the 
sorting  network  presented  in  Section  3.3.  The  operation  of  the  network  is 
modeled  by  the  causal  equations 


Figure  3.4  -  A  special  case  of  the  sorting  network. 


°2 

s 

n 

(3.45. a) 

°3 

= 

n 

ming(7rg 

.  a2) 

(3.45.0) 

°4 

= 

n 

min0(7r3  . 

°3) 

(3.45.0 

*1 

= 

n 

max^ffg 

■  a2} 

(3.45. d) 

*2 
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(3.45. e) 

in  order  to  express  the  output  sequence  o4  in  terms  ot  tne  input 
sequence  we  start  by  solving  (3.45.a/d>  for  a 2.  By  Theorem  3.2.  o2  is 

given  by 

2 

a„  =  /  n  max.(7r0  .  7j) 


(3.46) 
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We  then  solve  (3.46)  and  (3.45.b/e)  for  a 3  and  substitute  the  result  in 
(3.45.C)  to  obtain  the  network  I/O  description  In  the  form 

o4  =  nmin0(7T3  .  nmin0(nmax0(773.C)  .  /  n2max0<nmax0(7r3.()  .  i))))  (3.47) 

Although  (3.47)  describes  explicitly  the  output  in  terms  of  the  input,  it 
may  be  very  difficult  to  use  it  for  the  verification  of  the  network  for  given 
inputs,  especially  if  the  size  of  the  network  k  is  to  be  kept  as  a  parameter. 
As  a  matter  of  fact,  the  non  systematic  approach  followed  in  Section  3.3 
proved  to  be  more  effective  for  the  formal  verification  of  this  sorting  net¬ 


work. 
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4.  COMPUTER  SOLUTION  OF  SYSTEMS  OF  CAUSAL  EQUATIONS. 

It  was  shown  in  Section  3.4  that  we  can  always  obtain  analytical  solu¬ 

tions  of  systems  of  causal  equations  that  model  systolic  networks.  However, 
our  tools  for  manipulating  sequences  are  still  limited  and.  in  many  cases, 
we  are  not  able  to  derive  the  output  sequences  in  a  form  that  may  be 
compared  with  the  expected  output  of  the  network,  in  order  to  alleviate  this 
problem,  we  describe  in  this  chapter  a  computer  system  that  was  developed 
for  solving  iteratively  any  system  of  causal  sequence  equations  for  specifi¬ 
cally  given  input  sequences. 

A  simple  language  called  SCE  (Systems  of  Causal  Equations)  is  used  to 
provide  the  system  of  causal  equations  to  the  solver,  it  will  be  described 
m  Section  1.  The  equation  solver  itself  represents  a  syntax  directed  inter¬ 
preter  that  executes  any  correct  SCE  program.  This  interpreter  is  outlined 
in  Section  2,  it  reads  the  elements  of  the  input  sequences  from  an  input 
file,  and  calculates  the  elements  of  the  sequences  on  the  left  side  of  the 
equations  specified  by  the  program. 

it  should  be  noted  that  all  data  sequences  considered  here  have  infinite 
length  Dut  contain  only  finitely  many  elements  different  from  the  don't  care 
element  6.  Accordingly,  an  upper  bound  MAXT  is  assumed  to  be  given  by 

the  user  for  the  maximal  index  of  non-fl  data  items  of  all  sequences,  in 

other  words,  elements  beyond  MAXT  in  any  sequence  are  considered  to  be 
equal  to  0. 

Although  the  generality  of  the  solver  allows  it  to  be  used  for  wide 

range  of  tasks,  its  immediate  application  will  be  to  simulate  computations  on 
systolic  networks.  For  this,  an  SCE  program  is  written  which  implements 
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the  equations  for  the  operation  of  the  network,  and  then,  an  input  file  is 
created  that  contains  the  elements  of  the  input  data  sequences.  Now  a  run 
of  the  SCE  interpreter  provides  the  elements  of  the  output  data  sequences. 
This  approach  to  the  simulation  of  systolic  networks  separates  the  internal 
details  of  the  simulator  from  the  concern  of  the  user,  it  also  has  the 

advantage  of  allowing  the  user  to  begin  with  a  partial  solution  of  the  system 
of  equations  modeling  the  network  and  then  to  use  the  SCE  interpreter  on 
the  portion  of  the  system  that  could  not  be  solved  analytically.  As  an 
example,  we  show  in  Section  3  how  the  operation  of  a  network  for  the  III 
decomposition  of  a  symmetric  banded  matrix  may  be  simulated  by  means  of 
the  SCE  interpreter. 

4.1.  The  SCE  language  for  specifying  Systems  of  Causal  Equations. 

The  SCE  language  is  a  simple  expression  language  augmented  with 
some  input/output  facilities  and  looping  capabilities  that  provide  for  efficiency 

m  the  writting  of  programs-  In  its  current  form.  SCE  may  be  used  to 

model  the  operation  of  a  special  class  of  systolic  networks  in  which  the 
units  of  information  are  real  numbers.  However,  by  the  addition  of  new 
rules  to  the  grammar,  it  is  possible  to  model  other  types  of  systolic  net- 

worxs  at  higher  or  lower  levels  of  architectures. 

By  tne  first  rule  in  the  grammar  given  in  Appendix  B.  it  is  readily  seen 
that  an  SCE  program  consists  of  the  following  four  parts:  1)  The  declara¬ 
tions.  2)  the  input  part.  3)  the  programs  body,  and  4)  the  output  part.  In 

the  rest  of  this  section,  we  will  discuss  the  semantics  of  the  language. 

Terminal  symbols  in  SCE  (see  Appendix  B>  can  be  classified  into  four 
categories,  namely,  special  symbols  (e.g.  +.  -.  *.  ...).  reserved  words  (eg. 
.-Oft.  OUT.  ..).  identifiers  and  constants,  where  a  constant  is  either  a  posi¬ 
tive  integer  or  a  positive  real  number  written  in  floating  point  format. 


in  order  to  ensure  a  clear  distinction  between  identifiers  and  reserved 
words  in  SCE.  we  have  chosen  all  the  reserved  words  in  the  language  to 
start  with  capital  letters.  On  the  other  hand,  any  string  of  alphanumeric 
characters  starting  with  a  lower  case  letter  can  be  used  as  an  identifier 
with  the  understanding  that  only  the  first  six  characters  are  significant, 
identifiers  in  SCE  should  be  declared  in  the  declaration  part  of  the  program 
to  be  of  one  of  the  following  types: 

1)  Parameter  (rules  3-7):  A  parameter  is  assigned  an  integer  value  at  the 
time  of  its  declaration  and  this  value  is  substituted  textually  whenever  the 
identifier  appears  in  the  program. 

2)  Index  (rules  8-11):  An  index  in  SCE  is  an  integer  variable  used  in  loop 
control  and  in  the  selection  of  elements  of  sequence  arrays. 

3)  Sequence  (rules  12-19):  Sequences  are  represented  on  the  machine  by 

vectors.  An  identifier  of  type  'sequence'  may  be  associated  with  either  a 
single  sequence  (rule  16)  or  with  an  array  of  sequences  (rule  15).  For 
arrays  of  sequences,  the  dimension  and  the  lower  and  upper  bounds  are 
specified  in  the  declaration  by  enclosing  these  bounds  in  curly  brackets. 

For  example,  the  following  SCE  statement  declares  s  as  an  n  dimensional 

sequence  array 

SEQN  s  ( I ,  :u,  ,  •  •  •  .1  :u  } 
ll  n  n 

where  I.  and  u..  /=l.  •••,/>  are  the  lower  and  upper  bounds  for  the  / 
dimension.  Bounds  may  be  negative  but  should  of  course  satisfy  the  res¬ 
triction  that  u.>i..  We  also  note  that  there  is  no  limit  on  the  dimensionai- 

i  i 

ity  of  an  array. 

After  the  declaration  of  an  array  of  sequences,  its  elements  may  be 

identified  (rules  38-41)  by  using  the  usual  selection  notation 
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s(p1  .  •  •  •  .  pn).  where  each  p.  has  to  fall  Into  its  corresponding  range, 
that  is 

i  i  i 

In  the  context  of  the  abstract  systolic  model,  a  data  sequence  is  asso¬ 
ciated  with  a  communication  fink  which  is  identified  by  its  color  and  the 
label  of  the  node  at  which  it  terminates.  In  order  to  simplify  the  SCE 
specification  of  a  systolic  network,  we  label  the  nodes  in  the  network  by 
n-tupies  of  integers  •••*>*„)  with  some  fixed  n.  This  enables  us  to 

group  ail  the  sequences  associated  with  the  links  that  have  the  same  color 
in  an  n-dimensionai  array.  Of  course  the  color  of  the  link  may  be  used  to 
identify  the  array.  With  this,  the  sequence  associated  with  any  link  1  vn 
is  simply  the  element  )  of  the  sequence  array  yl). 

Although  this  leads  usually  to  a  very  clear  SCE  specification  of  a  sys¬ 
tolic  network,  it  is  sometimes  inefficient  because  some  of  the  elements  in 

the  array  may  not  be  used.  For  example,  in  the  LU  network  described  in 

Section  3.  a  triangular  array  of  sequences  would  be  more  space  efficient 
than  the  rectangular  array  allowed  by  SCE.  In  such  cases,  a  more  efficient 

storage  arrangement  could  be  obtained  by  applying  any  one  of  the  tech¬ 

niques  used  for  storing  triangular  and  sparse  matrices  [18]. 

in  addition  to  arrays  of  sequences,  the  language  allows  the  user  to 
declare  single  sequences.  Three  standard  single  sequences  are  predefined 
by  the  language,  namely  the  don't  care  sequence,  the  zero  sequence  and 
the  unity  sequence.  The  first  two  sequences  were  defined  in  Section  2.1. 

The  unity  sequence  r.  as  its  name  implies,  is  defined  by  r(f)=1.0  for 

M 

Kr<T(r)  and  arbitrary  large  T(r).  The  sequences  0  .  i  and  r  are 

denoted  in  SCE  by  the  identifiers  d.  o  and  u.  respectively.  The  user  how¬ 
ever  may  re-declare  the  identifiers  d.  o  or  u  if  he  wishes  to  change  their 


definitions. 
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The  input  part  of  an  SCE  program  has  the  form  of  a  single  INPUT 
statement  (rules  68-73).  It  serves  two  purposes;  Firstly,  it  assigns  an 
integer  value  to  MAXT.  which  specifies  the  number  of  elements  to  be  con¬ 
sidered  in  any  sequence,  and  secondly,  it  specifies  the  sequences  to  be 
read  from  the  input  file.  Nested  FOR  loops  can  be  used,  to  any  level,  in 
specifying  the  input  sequences.  For  example,  in  the  program  presented  in 
Section  3.  the  statement 

INPUT  (MAXT  18  .  FOR  /=  0,3  r(i.V  ); 

specifies  that  MAXT=18  and  that  the  input  sequences  are  rCO.l).  r(l.i),  r(2.l) 
and  r{3.1). 

Similarly,  the  output  part  of  the  program  takes  the  form  of  a  single 

statement  (rules  74-78)  that  specifies  the  sequences  to  be  printed  on  the 
output  file.  Again.  FOR  loops  are  allowed. 

The  body  of  the  SCE  program  Is  the  part  that  contains  the  specification 
of  the  system  of  sequence  equations.  It  consists  of  a  list  of  statements, 
where  a  statement  may  be  either  a  sequence  equation  or  a  FOR  loop  that 

encloses  a  list  of  statements.  Each  equation  has  the  form 

sequence  specif ication  =  sequence  expression 

where  the  left  side  identifies  a  particular  sequence  and  the  right  side  is  an 
expression  composed  of  sequence  identifiers  and  sequence  operators.  Square 
brackets  may  be  used  in  sequence  expressions  to  override  the  precedence 
rules  defined  by  the  grammar.  Basically,  in  the  evaluation  of  expressions, 
the  grammar  associates  the  highest  priority  with  the  operators  defined 

directly  on  sequences.  Next  in  priority  is  the  scalar  multiplication  operator 
followed  by  tne  operators  and  Finally,  the  operators  and 
aro  evaluated  with  the  lowest  priority.  With  these  precedence  rules  the 
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sequence  expressions  are  evaluated  from  left  to  write. 

Although  many  other  sequence  operators  may  be  incorporated  into  the 
language,  we  only  allowed  for  the  following  operators: 

The  positive  shift  operator  fir.  written  in  SCE  as  0{r), 

The  zero  shift  operator  fig,  written  in  SCE  as  Z{r). 

The  spread  operator  er.  written  In  SCE  as  T(r}. 

The  expansion  operator  E^.  written  in  SCE  as  Efr.k). 

r  k  s 

The  accumulator  operator  A  '  .  written  In  SCE  as  A(r.k.s)  and 

w  1  wn 

The  multiplexing  operator  Mr  ""  written  in  SCE  as  Mtr.wl,  •  •  •  ,wn}. 

Frequently,  the  shift,  zero  shift  and  spread  operators  are  used  with  r-1. 
For  this,  the  short  hand  notation  O.  Z,  and  T  may  be  used  instead  of  0(1). 
Z{1)  and  T{1}.  respectively. 

The  two  element-wise  operators  U1  and  U2  of  rules  47  and  48  have 
the  same  priority  as  the  Operators  *  and  /  but  their  semantics  are  not 
specified  by  the  language.  As  indicated  in  Section  3.  U1  and  U2  may  be 
defined  by  the  user. 

Finally,  we  note  that  rule  55  restricts  the  operands  of  the  accumulator 
operator  to  single  sequences  rather  than  sequence  factors  as  is  the  case 
with  the  other  operators.  This  restriction  is  not  necessary  and  was  only 
imposed  because  it  leads  to  a  more  efficient  implementation  of  the  SCE 
interpreter.  However,  it  should  be  noted  that  this  does  not  affect  the 
expressive  power  of  the  language  because  we  can  always  define  intermediate 
sequences  to  get  around  this  restriction.  For  example,  the  sequence  equa¬ 
tion 

xiiJ  =  0C2J  x(i-W  t  AH.3.U  [  ydfV  *  T  xO-U  ] 

wmcn  is  not  permitted  in  SCE  can  be  split  into  the  two  SCE  legal  equa¬ 
tions 


64 


vli)  =  yCitV  *  T  xli-M 

x(i)  =  0(2}  xli-M  +  /U1.3.1J  vli). 


4.2.  Overview  of  the  SCE  interpreter. 

As  is  the  case  with  most  language  interpreters,  the  SCE  solver  has  two 
distinct  phases,  namely,  the  syntax  analysis  phase  which,  using  a  parse  tree 
of  the  program,  produces  an  intermediate  language  program,  and  the  actual 
interpretation  phase  which  executes  the  Intermediate  program. 

For  the  syntax  analysis  phase,  we  used  the  automatic  parser  generator 
YACC  [25]  existing  on  the  UNIX  operating  system  to  generate  an  LRCI) 
bottom-up  parser  that  accepts  any  syntactically  correct  SCE  program  and 
generates  an  intermediate  program  in  the  form  of  a  sequence  of  tuples.  It 
is  basically  a  finite  state  machine  with  a  stack.  It  scans  the  input  program 
from  left  to  right  and  is  capable  of  reading  and  remembering  the  next  input 
token  (terminal  symbol)  which  is  called  the  look-ahead  token.  Depending 
on  the  look-ahead  token  and  the  content  of  the  stack,  the  parser  takes  one 
of  the  following  actions; 

1)  Shift:  The  current  look-ahead  token  is  pushed  into  the  stack  and  the 
next  token  is  read  in.  Also  a  tuple  describing  the  action  is  generated.  If 
tne  token  being  shifted  is  a  special  symbol,  an  identifier,  or  a  reserved 
word,  the  tuple  generated  has  the  form  (Shift. n),  where  n  is  a  number 
identifying  the  token.  On  the  other  hand,  if  the  token  is  an  integer  or  a 
real  constant,  then  the  tuple  generated  has  the  form  (Shift-integer. c)  or 
tShift-real.r).  respectively,  where  c  or  r  is  the  value  of  the  constant. 

2)  Reduce:  This  action  is  taken  when  the  parser  recognizes  that  the  stack 
contains  the  right  hand  side  of  a  grammar  rule,  say  rule  n.  and  that  this 
ruie  should  be  applied  at  this  point.  it  then  pops  from  the  stack  the 
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tokens  forming  the  right  side  of  rule  n  and  pushes  onto  the  stack  the  token 
on  its  left  side,  it  also  generates  a  tuple  (Reduce. n). 

3)  Accept:  This  action  is  taken  when  the  parsing  process  is  successfully 
completed.  The  tuple  generated  in  this  case  is  (Accept.O). 

4)  Error:  If  the  parser  discovers  that  the  program  is  syntactically  incorrect, 

it  simply  gives  a  warning  and  halts.  Of  course,  more  elaborate  error  han¬ 
dling  actions  could  have  been  taken  if  our  goal  was  to  produce  a  more 
sophisticated  parser.  For  the  details  and  internal  forms  of  LR  parsers,  we 

refer  to  12]. 

The  second  phase  of  the  interpreter  reads  the  intermediate  program 

(sequence  of  tuples)  and  reproduces  the  actions  taken  by  the  parser  on  an 
action  stack.  Simultaneously,  an  adjoint  value  stack  is  used  to  store  tem¬ 
porary  values  needed  in  the  interpretation.  In  Appendix  C.  we  give  the 
complete  listing  of  a  C  program  for  the  second  phase  of  our  SCE  solver, 
and  in  the  rest  of  this  section  we  will  outline  the  main  features  of  this 
solver/interpreter. 

The  program  uses  a  location  counter  "location"  to  indicate  the  inter¬ 
mediate  tuple  being  interpreted.  Starting  with  location  =1.  the  interpreter 

reads  the  tuple  pointed  to  by  "location",  takes  a  certain  action,  increases 

location  by  one  and  then  repeats  the  above  cycle.  The  action  taken  In 

each  cycle  depends  on  the  type  of  the  tuple  being  interpreted: 

1)  if  the  tuple  is  of  the  type  (Shift, n)  or  (Shift-integer.n).  then  n  is  pushed 

into  the  action  stack. 

2>  if  the  tuple  is  of  the  type  (Shift-real, r),  then  r  is  pushed  into  the  value 

stack  and  a  zero  is  pushed  into  the  action  stack. 

3;  if  the  tuple  is  of  the  type  (Accept.O),  then  the  interpretation  is  ter¬ 


minated. 


4)  if  the  tuple  is  of  the  type  (Reduce. n).  where  grammar  rule  n  has  the 
form  b  -  a1  a2  •••  ak  then  the  interpreter  pops  the  k  top  locations  of 
the  stack,  which  should  contain  the  symbols  a^.  •••  ,ak  and  pushes  b.  it 
also  may  execute  a  semantic  routine  if  any  is  associated  with  this  grammer 
rule.  These  routines  manipulate  the  data  on  the  value  stack  to  reflect  the 

semantics  of  the  grammar  rule. 

At  this  point  we  note  that  we  do  not  actually  have  to  push  or  pop  the 
grammar  symbols  in  the  action  stack,  and  that  it  suffices  to  keep  track  of 
the  top  of  the  stack.  Then  at  the  time  of  a  reduction,  the  grammar  rule 
and  the  top  of  the  stack  determine  uniquely  the  location  that  each  symbol 

would  occupy  on  the  stack.  With  this,  we  can  transmit  information  from 
one  semantic  routine  to  another  by  pushing  this  information  into  the  stack 

in  the  place  of  the  grammar  symbol.  For  example,  the  semantic  routine 

associated  with  rule  36  uses  the  location  of  the  FOR  symbol  on  the  stack 

to  store  the  starting  address  of  the  first  statement  in  the  FOR  loop  body. 

Then,  when  the  execution  of  the  FOR  body  is  terminated,  the  routine  for 
rule  35  retrieves  this  address  from  the  stack  to  re-initiate  the  execution  of 
the  FOR  body  if  the  final  value  of  the  index  of  the  loop  is  not  yet  reached. 

in  order  to  reduce  the  storage  required  for  holding  the  intermediate 
tuples,  the  program  in  Appendix  C  reads  and  executes  the  tuples  in  four 

stages:  In  the  first  and  the  second  stage,  the  declaration  and  the  input  part 
of  tne  program  are  processed,  respectively,  in  the  third  stage,  the  system 
ot  equations  is  solved,  and  finally  in  the  fourth  stage  the  output  is  printed. 
We  briefly  comment  on  each  stage. 

The  declarations:  The  main  objective  of  this  stage  is  to  construct  the  sym¬ 
bol  table  and  to  allocate  storage  for  the  declared  sequences.  The  symbol 
table  sym_tabll"  is  an  array  of  records  with  three  fields.  The  first  field 
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contains  a  character  that  indicates  the  type  given  to  the  identifier,  namely 
P.  I  or  S  for  parameters,  indices  or  sequences,  respectively.  The  interpre¬ 
tation  of  the  integers  in  the  second  and  third  fields,  called  entry 1  and 
entry2.  depends  on  the  type  of  the  identifier.  For  parameters,  entry!  con¬ 
tains  the  value  of  the  parameter  and  entry2  is  immaterial.  For  index  vari- 
aoies.  entryl  holds  the  value  of  the  index,  initialized  to  zero,  and  entry2  is 
set  during  the  execution  of  a  FOR  loop  to  the  final  value  of  the  loop  index. 

Finally,  if  the  identifier  is  declared  as  a  sequence  variable,  then  it  may 
denote  a  single  sequence  or  an  array  of  sequences.  Single  sequences  are 
distinguished  by  setting  entry2=-l.  with  entryl  pointing  to  the  location  where 
the  sequence  is  stored.  For  arrays  of  sequences.  entry2  holds  a  pointer  to 
a  bound  table  that  indicates  the  dimension  and  the  bounds  of  each  array, 
and  entry!  points  to  the  location  where  the  first  sequence  in  the  array  is 

stored.  The  first  three  locations  In  the  symbol  table  are  reserved  for  the 

identifiers  d.  o  and  u.  respectively  that  are  preset  to  the  don't  care,  the 

zero  and  the  unity  sequences,  respectively.  However,  if  any  of  these  iden¬ 
tifiers  are  declared  in  the  program,  then  the  corresponding  entry  in  the 
symbol  table  is  overwritten  by  the  semantics  routine  corresponding  to  the 

new  declaration. 

The  sequences  are  stored  in  a  two  dimensional  array  seq_store[][]. 
Each  row  in  the  array  has  a  length  at  least  equal  to  MAXT  and  is  used  to 

store  the  elements  of  a  sequence.  Arrays  of  sequences  are  stored  in  con¬ 
secutive  rows  such  that  any  index  changes  slower  than  the  one  to  its  right, 
if  any.  in  order  to  keep  track  of  don't  care  elements,  an  array  d-tableUH 

of  oits  is  used  such  that  for  each  element  in  seq_storenn.  there  is  a 
corresponding  bit  in  d-table(]U.  This  bit  is  set  to  one.  if  the  element  in 

seq_store[](]  is  a  don't  care,  and  to  zero,  otherwise.  Thus,  any  part  of  the 
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program  that  reads  an  element  from  seq_store[][]  has  to  inspect  also  the 
corresponding  entry  in  d-tableUU. 

The  implementation  of  the  SCE  solver  listed  in  Appendix  C  allows  for 

full  causality  in  the  sense  that  an  output  may  depend  on  any  previous  input. 
Accordingly,  storage  is  provided  for  the  retention  of  at  least  MAXT  elements 
from  any  sequence.  A  more  space-efficient  implementation  would  be  to 
retain  only  the  last  C  elements  from  each  sequence  in  a  circular  buffer, 
where  C  is  given.  This  allows  only  for  C-order  causality  in  the  sense  that 

the  output  of  a  certain  ceil  at  any  given  time  t  may  only  depend  on  the 

inputs  to  that  cell  during  the  time  period  from  t-C  to  t-1. 

The  input  part:  The  INPUT  statement  specifies  the  sequences  to  be  read 

from  the  input  file  as  well  as  the  number  MAXT  of  elements  In  each 

sequence.  The  interpreter  reads,  as  a  stream,  the  MAXT  elements  of  the 
first  specified  sequence  followed  by  those  of  the  second  sequence,  etc., 

provided  that  the  elements  are  separated  by  at  least  one  space.  No  spe¬ 
cial  characters  are  required  to  separate  the  elements  of  the  different 
sequences.  Each  element  in  the  input  file  may  be  either  a  floating  point 
number  or  the  letter  "d"  representing  a  don't  care  element.  The  interpreter 
aiso  recognizes  the  string  “•••’  in  the  input  file  as  an  indication  that  the 

remaining  elements  in  that  sequence  are  don't  cares. 

The  equation  solver:  The  sequence  of  tuples  in  the  body  of  the  program 

are  executed  iteratively  MAXT  times.  A  global  clock  "TIME"  is  initialized  to 
1  and  incremented  at  every  step  of  the  iteration.  At  every  step,  the 
expression  on  the  right  side  of  each  equation  is  evaluated  at  time  TIME  and 
assigned  to  the  corresponding  element  of  the  sequence  on  the  left  side  of 
tne  equation. 

The  value  stack  is  used  during  the  evaluation  of  sequence  expressions 
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to  store  temporary  results.  For  example,  the  semantic  routine  associated 
with  the  grammar  rule  52  (seq_factor  -  seq_spec)  reads  the  value  of  the 
element  TIME  in  the  specified  sequence  from  seq_store[Hl.  and  pushes  it 
onto  the  value  stack.  The  result  of  any  subsequent  sequence  operation  is 
stored  on  the  stack  until  rule  37  is  executed  and  the  final  result  is  stored 
back  into  seq_store[][]. 

The  sequence  operators  0.  Z.  T  and  E  operate  on  sequence  factors 
and  have  the  effect  of  changing  the  global  clock  during  the  evaluation  of 
the  corresponding  factor.  The  old  clock  value  Is  stored  in  the  action  stack 
for  retrieval  after  the  evaluation  of  the  factor  is  complete  (rule  53).  If  the 
result  of  any  operation  involving  the  above  operators  is  the  don't  care  ele¬ 
ment.  then  the  flag  "skip"  is  set  which  causes  the  execution  of  the  seman¬ 
tic  routines  to  be  skipped  until  the  corresponding  tuple  (Reduce. 53)  is 

encountered.  Of  course  provisions  are  made  to  deal  with  arbitrary  degrees 
of  nesting. 

in  a  similar  way.  the  flag  "Mskip"  is  used  to  chose  the  appropriate 
operand  in  the  multiplexer  operator.  Finally,  we  note  that  by  restricting  the 
operands  of  the  accumulator  operator  to  sequences  instead  of  sequence 
factors,  we  simplified  greatly  the  action  associated  with  that  operator.  For 

a  detailed  description  of  the  different  semantics  routines,  we  refer  to  the 
complete  listing  of  the  program  in  Appendix  C. 

it  is  important  to  note  that  the  SCE  interpreter  detects  any  incon¬ 

sistency  in  the  given  equations  or  any  attempt  of  solving  equations  which 
are  not  causal  or  weakly  causal.  It  does  so  by  associating  with  each 
sequence  an  entry  in  the  array  last_computed(]  to  keep  track  of  the  last 
element  that  has  been  computed  in  the  sequence  so  far.  Any  attempt  to 
overwrite  an  already  calculated  element  or  to  read  an  element  that  has  not 


yet  been  calculated  is  then  easily  detected  and  reported  as  a  run  time 
error.  The  interpreter  also  detects  other  types  of  run-time  errors  that  are 
listed  in  the  function  run_error  in  the  Appendix. 

The  output  part  After  completing  the  interpretation  of  the  body  of  the  pro¬ 
gram.  the  sequences  specified  by  the  user  are  printed  on  the  standard  out¬ 
put  file. 

The  SCE  simulator/solver  was  used  to  simulate  the  operation  of  the 
systolic  networks  that  have  been  verified  analytically  in  Chapters  3  and  7. 
in  the  next  section,  we  will  illustrate  its  application  by  applying  it  to  the 
simulation  of  an  LU  factorization  network  that  will  be  used  in  Chapter  9. 
Although  this  approach  to  the  simulation  of  systolic  networks  is  very  simple, 
it  should  be  clear  that  it  can  be  used  only  to  verify  instances  of  computa¬ 
tions;  that  is.  all  architectural  parameters  and  input  data  have  to  be  given 
specific  values  during  the  simulation.  Of  course,  this  observation  applies  to 
any  simulator  or  numeric  solver,  and  the  only  way  to  allow  for  a  more 

general  simulator  would  be  to  consider  symbolic  manipulation  which  pro¬ 
duces  a  symbolic  description  of  the  outputs  in  terms  of  the  Inputs. 

Finally,  we  note  that  a  possible  optimization  of  the  implementation  of 
the  interpreter  could  be  achieved  by  replacing  the  single  value  stack  by  k 
stacks,  for  some  optimal  k.  This  would  reduce  the  total  number  of  itera¬ 
tions  through  the  body  of  the  program  by  considering  at  each  step  the  ele¬ 

ments  TIME,  TiME-t-1,  •  •  •  .  TiME+k  of  the  sequences  instead  of  only  one 
element  at  a  time.  However,  if  the  system  contains  any  recursion,  then 
only  few  of  these  k  elements  (and  in  many  cases  only  one)  can  be  con¬ 
sidered  at  each  step,  and  this  requires  more  complicated  book  keeping  to 
update  the  array  iast_computedlJ  and  the  global  clock.  We  decided  not  to 

implement  this  optimization  because  we  intended  the  solver  to  be  used  in 


cases  where  analytical  solutions  of  the  system  of  equations  are  difficult,  and 
hence  where  recursivity  is  usually  present. 

4.3.  Example:  An  LU  factorization  network. 

In  this  section  the  SCE  interpreter  is  applied  to  the  simulation  of  the 

computations  of  a  network  for  the  LU  or  the  U* DU  factorization  of  a  sym¬ 
metric  banded  matrix  A  and  the  solution  of  the  linear  system  of  equations 

Ax=y  with  a  given  vector  y.  It  will  be  shown  also  that,  with  slight  modifica¬ 
tions.  the  same  network  can  be  used  to  compute  the  Choiesky  decomposi¬ 

tion  LiJ  of  the  matrix  A. 

The  first  systolic  network  for  factoring  a  banded  matrix  into  the  product 
of  a  lower  triangular  matrix  and  an  upper  triangular  matrix  was  suggested 
by  Rung  and  Leiserson  [31].  Later.  Brent  and  Luk  [7]  modified  the  Rung 
and  Leiserson  network  to  compute  the  Choiesky  decomposition  of  symmetric 
matrices.  The  network  described  in  this  section  is  also  designed  for  sym¬ 
metric  matrices  but  is  different  in  its  operation  principle  from  the  one  given 
in  [7],  Both  networks  use  almost  the  same  number  of  computational  cells 
and  achieve  approximately  the  same  speed-up  over  serial  execution.  They 
differ  however  in  the  type  of  computational  cells  and  in  their  interconnec¬ 
tions.  . 

Consider  the  system  of  linear  equations 

A  x  =  y  (4.1) 

where  A  is  an  n  x.n  matrix  and  x  and  y  are  n  dimensional  vectors.  The 
solution  x  of  (4.1)  may  be  obtained  by  finding  a  lower  triangular  matrix  L 
and  an  upper  triangular  matrix  U  such  that  A  =  L  U.  and  by  solving  the 
two  triangular  systems  L  z  -  y  and  U  x  =  z.  More  specifically,  assuming 
tnat  A  is  symmetric  and  banded  with  band  width  2k+l.  and  denoting  the 


elements  of  the  matrices  A.  L  and  U  by  a.  j .  L  ,  and  j.  respectively,  we 
may  use  the  following  algorithm  to  compute  the  LU  decomposition  of  A. 
Note  that  only  the  elements  .  with  l>i  are  used  and  that  only  the  non 
zero  elements  of  L  and  U  are  computed. 

ALG2  :  LU  factorization. 


FOR  1=1. 


,n  DO 


1)  FOR  j=i.  •  •  •  .  min{n.  i+k)  DO 


1  1}  '/./  *  *U 
1.2)  u,  ,  = 


2)  FOR  q=i+l.  •  •  •  ,min(n,  l+k)  DO 


FOR  j=q. 


,min{n.  i+k)  DO 


*q.i  ~  *q.i  ' q.i  Ui ./ 

At  this  point  we  note  that  the  matrix  L  obtained  by  the  above  algorithm 
satisfies  L~UT0.  where  D  is  the  diagonal  matrix  defined  by  d.  .-I.  ..  Also, 
by  replacing  steps  1.1  and  1.2  in  ALG2  by 


/.  .  =  u.  .  - 

I  .i  i  .1 


we  obtain  an  algorithm  for  the  Cholesky  decomposition  LL  of  A. 

After  having  performed  the  LU  decomposition  of  A.  we  may  compute  the 
vector  z=L_1y  by  the  following  algorithm 

ALG3  :  Back  substitution. 


FOR  i=l.  •  •  •  .n  DO 


h 


FOR  q=i+l.  •••  .mintn,  i+k)  DO 


Finally,  the  solution  of  Ux=z  may  be  obtained  by  an  algorithm  similar 


to  ALG3,  or  alternatively  by  using  ALG3  itself  for  the  solution  of  U  7  *  F. 
where  =  zn_.+^.  7.  =  +  1  and  U  is  a  banded  lower  triangular  matrix 

with  the  elements  7.  .  =  u  , . 
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The  graph  of  the  systolic  network  that  executes  ALG2  and  ALG3  simul- 

(k  1 1 )  (k  1*4) 

taneousiy  is  shown  in  Figure  4.1.  It  is  composed  of  - ^ -  interior 

nodes.  Each  node  is  labeled  by  a  pair  (i.j).  where  i  and  j  are  the  coordi¬ 
nates  of  the  node  with  respect  to  the  two  axes  shown  in  the  figure.  The 
color  of  each  edge  is  determined  by  its  direction.  More  specifically,  edges 
directed  to  the  east,  south  and  south  west  are  given  the  colors  s.  b  and  c. 
respectively,  and  those  directed  north  are  assigned  the  colors  r  or  p 
depending  on  their  relative  position. 


The  part  of  the  graph  that  is  formed  by  nodes  (i.j).  /=1.  •  •  •  .k  +  1. 
/  =  !.-  •  -.k-i+2  represents  a  subnetwork  that  executes  ALG2.  it  consists  of 
three  types  of  nodes  whose  operation  is  described  by  the  following  equa- 


Figure  4.1  -  The  graph  for  the  LU  network 


For  node  (k+l.U 

°k.  1  =  n0  tr  y  Ipk  +  1.1  "  YA+1,131 


(4.5) 


where  r  is  the  unity  sequence  defined  by  r(f)=l.O  for  any  t. 
For  nodes  (i.l).  1=1.  •  •  •  .k 


p/.2  =  n0  Ip/,1  '  yi. I1 

vi.2  s  n0  lai^  '  lp/.l  ’  VlJ]  =  n0  °i.  1  *  P/.2 


(4  6. a) 
(4. 6  b) 
(4.6.C) 


For  nodes  (i.p.  j=2. •••.k+1.  i=l.- •  •  .k+2-j 


J.  ,  .  = 

o.  . 

(4. 7. a) 

l-hl 

0  i  .1 

*/.;+ 1  = 

%  * i.i 

(4. 7  b) 

p/./+i  = 

n0  Pl.l 

(4. 7.c) 

>/ti./-i 

*  no  lyu  *  “u  '  °u‘ 

(4.7.d) 

On  the  other  hand,  the  part  of  the  graph  composed  of  the  nodes  (O.j). 
/  =  *  +  i  corresponds  to  a  subnetwork  that  executes  ALG3.  The  opera¬ 

tions  of  the  cells  in  this  subnetwork  are  described  by 


For  node  (0.1) 

p0,2  =  n0  ^p0,l  "  ^o.l*  (4. 8. a) 

^0.0  ”  n0  fa0,1  *  fp0.1  ”  *0.l”  =  n0  a0.1  *  p0.2  (4. 8.0) 

For  nodes  (O.p.  J=2.  •  •  •  .k+1 

P0./  +  1  =  ng  P0./  (4.9.a) 

*0./-l  =  n0  W0 ./  +  O0  l  •  p0  .]  (4.9.0) 

Note  that  the  nodes  (/.I),  f=0.  •••.*  correspond  to  subtract/multiply 
cells,  while  the  nodes  (/./').  /=2. *  •  •  .k  +  1.  /  =  1, •  *  •  .k+2-/.  are  multiply/add 
cells.  Only  the  node  (k+1,1)  is  a  subtract/divide  cell,  in  other  words,  the 
network  is  composed  of  three  basic  types  of  simple  computational  cells. 

For  the  proper  operation  of  the  network,  the  input  sequences  >1  .. 
/  =  !.••  •.*  +  !  and  £Q/(  +  1  are  set  to  the  zero  sequence  i.  and  the  input 

links  sk  +2-/  / '  /=1'  *  ’  *  •*  +  1.  are  connected  to  the  links  Pk+2-i  j  <see  ,i9“ 

ure).  that  is 


.  =  t  /  =  1.  •  •  •  ,k+1  (4. 10. a) 

*  1  <4-,ow 

°*t2-/./  ■  »*t2-/./  (4.10.0 


The  elements  of  the  matrix  A  and  the  vector  y  are  fed  into  the  network 
through  the  links  r.  ^ .  /=!.♦•  ♦.k  +  1  and  ^  l  ’  respectively.  The  precise 
input  specification  is  given  by 

p(.  1  =  nkt1“'  02  a.  /  =  1 , •  •  •  ,k  +  1  (4.11. a) 

p0fl  *  n£+1  e2  n  (4.n.b) 

where  T (a#.)=r»-(k  tl-/).  T (77)=/?  and 


a.(f) 

71(f) 


ar.f  tk  +  i-/ 
yf 


in  other  words.  +  contains  the  n-q  elements  of  the  q 


fh 


off  diagonal 


of  A.  and  7?  the  n  elements  of  the  right  hand  side  vector  y. 
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In  order  to  understand  the  principle  of  operation  of  the  network,  we 

first  note  that  at  iteration  step  i  of  algorithm  ALG2.  the  ith  row  of  U  and 

the  ith  column  of  L  are  computed  from  the  ith  row  of  A  (steps  1.1  and 

1.2).  However,  the  elements  of  the  matrix  A  are  continuously  modified.  In 

particular,  at  the  execution  of  step  i,  the  elements  in  row  i  of  A  had  been 
modified  by  subtracting  from  them  different  contributions  (step  2)  during  the 
steps  i-k . •  •  •  ,/-l.  in  the  systolic  network  of  Figure  4.1,  the  elements  of 
the  unmodified  ith  row  of  A  arrive  at  the  cells  (q.l).  g  =  l.  •  •  •  Jctl.  on  the 
r  colored  links.  At  the  same  time,  the  sum  of  the  contributions  from  the 

previous  iterations  /-*.•••./- 1  arrive  at  the  same  cells  on  the  c  colored 

links.  The  subtraction  is  then  performed  and  the  elements  of  the 

corresponding  column  and  row  of  L  and  U  are  computed  and  sent  out  on 
the  r  and  p  colored  links,  respectively.  These  elements  propagate  upward 
in  the  network  allowing  the  cells  (<?,/).  /  =2. *  •  •  .k  +  1.  g  =  l.  •  •  •  .k+2-/  to 

compute  and  sum  the  contributions  for  the  modification  of  the  subsequent 
rows  of  A.  These  contributions  are  sent  downward  on  the  c  colored  links. 
The  subnetwork  formed  by  the  cells  (0./),  /  =  !.•• -.k  +  l  operates  in  a  similar 
way. 


A  closer  study  of  the  behavior  of  the  network  shows  that  the  significant 
elements  of  the  matrix  U  are  sampled  from  the  links  p ^  ^ .  g=l.  •••,*,  and 
the  elements  of  the  partial  solution  vector  z  from  the  link  bQ  g.  These 
results  are  sufficient  for  the  computation  of  the  solution  x=U  V  However, 
the  elements  of  the  diagonal  matrix  D.  where  L-U^D.  are  also  available  on 
the  link  s  1 .  More  precisely,  the  output  sequences  are  expected  to  have 
the  following  description; 


q  =  1 ,  •  •  •  ,k 


(4. 12. a) 
(4.12.b) 
(4.12.0 


Where  T(tiq)=n-u<^-q).  T(C)=T(\)=n  and 


Vf>  =  Vft/c*  1-q 


X(f)  =  d 


t.t 


C(y )  =  2f 


After  the  computation  of  U  and  z  terminates,  we  may  use  the  network 
for  a  second  time  to  solve  Ux-z  and  obtain  the  vector  x.  Of  course  only 
few  cells  will  be  doing  useful  work  during  this  second  run.  At  this  point 
we  note  that  we  can  add  to  the  network  any  number  of  columns  of  cells 
identical  to  the  column  (0,/),  /-I, •  •  •.*+!.  This  enables  us  to  use  the 
network  to  solve  (4.1)  for  more  than  one  right  hand  side  vector  y  simul¬ 
taneously. 


Finally,  we  note  that  the  network  described  in  this  section  can  be 
modified  to  perform  the  Cholesky  decomposition  LLT  instead  of  the  UTDU 
decomposition.  For  this,  the  equation  (4.5)  for  the  operation  of  node 
(k+1,1)  has  to  be  replaced  by 


1 


Vl  =  n°  Wij  -  w 


1 


and  the  data  on  the  links  p.  ^  /*1<***.rt  have  to  be  set  equal  to  the  data 
on  v.  2<  This  has  the  effect  of  modifying  (4.7.d)  to 


=  n0  lyi.i  +  ni.i  *  ai.i] 


It  is  clear  that  in  this  case,  the  links  r.  /=2.  •••.*.  f  =  l.  •  •  •  ,k+2-/  carry 
redundant  information  and  hence  can  be  removed  from  the  network. 


After  this  description  of  the  network,  we  turn  our  attention  to  the  task 
of  simulating  its  operation.  First  of  all,  we  write  an  SCE  program  that 
describes  the  network  and  contains  the  equations  that  model  its  nodes,  in 
the  following  program,  the  parameter  k  which  determines  the  size  of  the 


network  is  set  to  2. 


swra 


y.v.  r~. 


'x  *.\v 


.v> 
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The  SCE  program  for  the  network  of  figure  4.1 

PAR  k=2  ; 

INDEX  j.j  ; 

SEQN  s(0:k,  l:k-*-l)  . 
r(0:k+1.1:k+2}  . 
p(l:k,l:k+2}  . 
c{1:k+l.l;k+1}  . 
b(0:0.0:k+l) 


INPUTC  MAXT  18.  For  i=0.k+1  r(l.l)  )  ; 

/* 

input  statement  */ 

FOR  j=l.k+l  DO  ctl.j}  =  o  END  ; 

/* 

equation 

(4.10  a)  V 

b(0.k+l}  =  o  ; 

r 

equation 

(4.10.b)  V 

s(k.l)  =  Z  f  u  /  Ir{k+l.l}  -  c(k+1.1}]  1  ; 

/* 

equation 

(4.5)  V 

FOR  i=l,k  DO 

r(i.21  =  Z  [  rfi.l}  -  c(i.l)  ]  ; 

/* 

equation 

(4. 6. a)  V 

p(i.2)  =  Z  s(i.l)  *  r(i,2)  ; 

/* 

equation 

(4.6. b)  V 

s(i-l.l)=  Z  s(i.l) 

END  ; 

/* 

equation 

(4.6.0  */ 

FOR  j=2.k+l  DO 

FOR  i=l.k+2-j  DO 

sCi-1  .j)  =  Z  s(i.j}  ; 

/* 

equation 

(4. 7. a)  */ 

p(i.j+l}  =  Z  p(i.j)  ; 

/* 

equation 

(4.7.b)  V 

r(i.j+l)  =  Z  rCi.jJ  : 

/* 

equation 

(4.7.0  */ 

C(i+l,j-l)  =  Z  (  c(i.j)  +  r(i.j)  *  s(i.j)  ] 

/* 

equation 

(4.7. d)  */ 

END  ; 

s(k+2~j.j)  =  p(k+2-j.j) 

r 

(4.10.C)  V 

equation 

END  ; 

r(0.2)  =  Z  [  r(0.1)  -  b(0.1)  J  : 

/* 

equation 

(4. 8. a)  */ 

0(0.0}  =  r(0,2)  *  Z  s(0.1) 

/* 

equation 

(4.8.b)  *7 

FOR  j=2.k+1  DO 

r(0.j+l)  =  Z  r(O.j)  ; 

/* 

equation 

(4. 9. a)  *7 

b(O.j-l)  =  Z{2)  [  b(O.j)  +  s (O.j)  *  r(0,j}  } 

/* 

equation 

(4.9. b)  V 

END  ; 

OUT(  0(0.0}  .  FOR  i=l.k  p(i,2)  .  s(k.l)  )  ; 

r 

output  statement  */ 

Next,  we  will  use  the  above  program  to 

simulate 

the  computation  of  the 

matrices  L.  U  and  the  vector  z  for  the  matrix  A  given  in  (4.2).  in  order  to 


specify  the  input  for  this  computation,  we  note  that  The  INPUT  statement  in 
the  above  program  limits  the  length  of  the  sequences  to  18  elements,  it 
also  determines  the  order  m  which  the  input  sequences  are  read  from  the 
input  file,  namely  pQ  ^  y  p ^  ^  and  then  p3  y  Accordingly,  we  follow 
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the  pattern  specified  by  <4.1l.a/b).  and  use  the  data  from  (4.1).  to  construct 
the  following  input  file 

Input  file: 

0.0  0.0  0.0  4.0  d  d  14.0  d  d  15.0  d  d  0.0  d  d  -6.0 

d  d  6.0  d  d  -3.0  d  d  -2.0  d  d 

a  4.0  d  d  15.0  d  d  2.0  d  d  1.0 

2.0  d  d  11.0  d  d  20.0  d  d  -19.0  d  d  14.0  ... 

Finally,  we  use  the  SCE  interpreter  to  run  the  above  program  with  the 

given  input.  This  produces  the  following  output  file 


««*<  OUTPUT  SEQUENCES  **** 

0.00  0.00  0.00  0.00  2.00  d  d  2.00  d  d  3.00  d  d  -3.00 
d  d  3.00  d 

XXXXXXXXXXXXXXXXXXXXXXX 

0.00  d  d  3.00  d  d  -1.00  d  d  2.00  d  d  d  d 
d  d  d  d 

XXXXAXXXXXXXXXXXXXXXXXX 

0.00  d  2.00  d  d  1.00  d  d  -5.00  d  d  -3.00  d  d 
d  d  d  d 

xxxxxxxxxxxxxxxxxxxxxxx 

0.00  0.50  d  d  0.33  d  d  -1.00  d  d  0.33  d  d  -0.11 
d  d  d  d 

xxxxxxxxxxxxxxxxxxxxxxx 

where  as  specified  by  the  OUT  statement,  the  sequences  are  printed  in  the 

order  0Q  q,  it i  2  an<3  then  ak  V  11  is  easy  t0  verify  that  this  output 

agrees  with  the  results  in  (4.3)  and  (4.4),  and  the  formulas  (4,l2.a/b/c). 

Finally,  we  note  that  the  potential  application  of  the  SCE  language 
presented  in  this  chapter  is  not  limited  to  the  solver/simulator.  For  exam¬ 
ple.  the  SCE  language  may  be  used  for  the  precise  specification  of  any 
systolic  network  that  can  be  described  in  terms  of  the  abstract  model,  in 
fact,  for  a  given  network,  one  may  write  an  SCE  program  in  which  the 
causal  equations  and  the  sequence  declarations  describe  completely  the 
graph  of  the  network  as  well  as  the  operation  of  each  of  its  cells.  This 
SCE  specification  may  be  used,  for  instance,  as  an  input  to  an  automatic 
lay-out  program  or  to  a  translator  that  generates  specifications  in  some 
unguaqe  used  in  the  computer  aided  design  of  VLSI  devices. 
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The  syntax  directed  approach  used  in  the  implementation  of  the 
solver/simulator  led  to  a  very  modular  program.  This  simplifies  the  task  of 
modifying  the  solver  to  incorporate  new  SCE  grammar  rules  that  need  to  be 
added  when  new  types  of  sequence  operators  are  introduced.  Actually,  the 
addition  of  a  new  sequence  operator  to  the  grammar  requires  only  the 
implementation  of  a  corresponding  semantics  routine  that  describes  the 
effect  of  the  operator. 


5.  ON  THE  FLEXIBILITY  AND  POWER  OF  THE  ABSTRACT  MODEL 


After  having  presented  the  basic  features  and  immediate  applications  of 

tne  abstract  model,  we  explore  in  this  chapter  some  issues  that  demonstrate 
the  flexibility  and  the  power  of  this  model,  in  particular,  we  discuss  two 

restrictions  that  were  imposed  on  the  model,  namely  the  absence  of  internal 
states  (memories)  associated  with  the  nodes  of  the  graph,  and  the  require¬ 
ment  that  the  graph  does  not  contain  direct  loops,  in  Sections  5.1  and 
5.3.  we  show  that,  despite  these  restrictions,  it  is  nevertheless  possible  to 

apply  the  model  to  computational  ceils  with  internal  memory  and  to  networks 

with  direct  feed  back  connections. 

The  application  of  the  model  to  the  issue  of  the  uniform  treatment  of 

data  and  control  signals  is  discussed  in  Section  5.2.  In  Section  5.4.  we 

suggest  a  tecnnique  that  simplifies,  to  a  great  extent,  the  verification  of 
pipelined  computations  in  systolic  networks.  Finally,  we  show  in  Section  5.5 
mat  tne  abstract  model  may  be  applied  to  self-timed  systolic  networks.  For 

mis.  we  modify  the  interpretation  given  in  Section  2.3  to  allow  for  the  model 
to  De  applicable  to  any  systolic  network,  irrespective  of  the  method  used  for 
me  synchronization  of  its  operation. 

5.1.  Modeling  computational  cells  with  internal  memory. 

The  abstract  model,  as  defined  in  Section  2.2.  does  not  explicitly  allow 

the  nodes  to  have  internal  states  or  memory.  in  fact,  the  specification 
given  in  Section  2.6  for  the  1-D  convolution  network  was  not  consistent  with 
tne  aostract  model,  since  in  equation  (2.8).  we  assumed  that  each  cell  in 
tne  networn  can  store  a  specified  real  number  w. . 


.V  ■ 


However,  we  note  that  the  definition  of  causal  operators  allows  any  out¬ 
put  to  depend  on  any  previous  input,  thus  eliminating  the  need  for  explicitly 
associating  memory  with  interior  nodes.  In  order  to  illustrate  this  idea,  we 
consider  the  cell  shown  in  Figure  5.1(a)  and  assume  that  the  equation 
describing  the  output  on  the  link  s/  +  1  is 

o/  +  1  =  n  fo.  +  f^J  p.j  *  nt]  (5.1) 

The  factor  e^p.  in  (5.1)  indicates  that  the  output  aj  +  j(t),  at  any  time  t. 
sk+Kf  <ts +l)k.  for  some  s>0.  will  depend  on  the  data  item  that  was  exist¬ 
ing  on  the  input  link  r.  at  the  time  t=skt  1.  Clearly,  any  physical  realiza¬ 
tion  of  this  ceil  needs  to  contain  a  memory  (see  for  eg.  Figure  5.1(b)). 


c 

ft 


(a) 


Figure  5.1  -  A  multiply/add  cell  with  internal  memory 
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Figure  5.2  -  A  1-D  convolution  network  with  writable  in-cell  memory 
With  this  we  can  now  give  a  consistent  specification  of  the  1-D  convo¬ 
lution  network.  We  add  to  the  network  of  Figure  2.7  the  new  links  r . 


i~0.  •  ■  ■  .k  that  will  be  used  to  input  the  values  of  the  convolution  constants 
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*1  ‘’"‘wk  int0  tfie  ce,,s’  The  9raPh  of  the  modified  network  is  shown  in 
Figure  5.2  and  its  operation  is  specified  by  the  causal  equations: 

Jfy_1  *  fi  itj  (5.2.a) 

P/_1  =  0  Pj  (5.2. b) 

o.  +  1  =  n  [ay  t  r^"/._1p/]  *  w.]  /’  =  1 .  •  ’  •  .k  (5.2.0 

With  this  specification,  the  network  I/O  description  may  be  obtained  in 

the  form 

o,  t  n2'-’  p„ )  •  <5.3) 

The  constants  are  supplied  to  the  network  through  the  input 

link  rfc.  The  idea  is  to  let  these  constants  flow  on  the  r  colored  links  and 

let  each  ceil  i  capture  its  corresponding  constant  as  it  arrives  at  its  own 
input  link  r..  The  cell  then  remenbers  its  constant  for  the  duration  of  the 

computation,  namely  for  2n  time  units.  Formally,  the  inputs  to  the  network 
are  described  by 

o1  =  n*  1  0  i 

"*  * 9 « 
p*  - 9 « 

where  i  is  the  zero  sequence.  T (()=n ,  T (u i)-k  and  (( t)=xf .  u(t)=w(.  Using 

this  input  in  (5.3)  and  applying  Properties  P10  and  P 4  in  Appendix  A  we 

obtain 

°*,i  -  n2*''  e  i  t  ne  E  n'’1  «  •  ii 

/=! 

This  leads  to  the  same  formula  as  in  Chapter  3.  namely  equation  (3.2). 

However,  we  note  that  the  convolution  constants,  in  the  modified  net¬ 
work.  are  supplied  as  an  input  rather  than  being  associated  with  the  cells. 
This  allows  for  the  possible  pipelining  of  different  computations,  with  different 


--'V 
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convolution  constants,  on  the  same  network.  We  will  discuss  pipelined 
operations  in  some  detail  in  Section  5.4. 

Finally,  we  note  that  although  the  k  cells  in  the  above  network  are 
identical,  the  operation  of  each  cell  /  is  determined  by  a  local  control 

parameter  that  determines  the  instants  at  which  the  memory  within  the  cell 
is  overwritten.  In  the  next  section,  we  discuss  possible  methods  for  con¬ 
trolling  the  operation  of  computational  cells  that  depend  in  their  operation 

on  local  parameters. 

5.2.  Controlling  the  operation  of  systolic  cells. 

r  k  c  w  1  wn  k 

As  mentioned  in  Section  2.4.  the  operators  A  '  ,  M r  .  and  Er 

can  be  used  to  model  systolic  cells  that  contain  accumulators,  multiplexers 
or  periodic  memories.  Here  the  indices  r,  k.  s  and  wl,***.wn  control 

different  timings  that  may  affect  the  operation  of  the  cell,  as  for  instance, 
the  reset  times,  the  idle  times  and  the  active  times  of  the  ceil.  One  way 
of  monitoring  these  different  timings  in  physical  cells  is  by  providing  each 
ceil  with  a  separate  circuit  that  generates  reset  and  idle  signals  at  the 

specified  times.  This  circuit  may  be  designed  either  for  a  specific  value  of 
the  control  parameter,  or  for  flexible  assignments  of  the  values  of  these 
parameters  according  to  a  desired  application. 

On  the  other  hand,  timings  may  be  monitored  by  signals  external  to  the 
ceil.  This  external  control  method  treats  data  and  control  signals  in  a  uni¬ 
form  manner  (27],  and  is  especially  preferred  in  systolic  networks  if  the 
timing  signals  can  be  propagated  systolically  within  the  network. 

As  an  example,  we  consider  again  the  modified  1-D  convolution  network 
discussed  in  the  last  section.  Equation  (5.2.0  specified  that  the  memory 
witrun  any  cell  i.  K/  4k .  is  overwritten  at  times  k+i-'\+2sn.  s=  O.l.***.  In 
order  to  apply  the  external  control  method,  we  add  to  each  cell  /  in  the 


network  an  input  link  c.  such  that  at  any  time  t.  the  memory  is  overwritten 

only  if  the  data  item  on  equals  *>^(0=1.  If  y.(t)= 0,  then  the  content  of 

the  memory  is  not  changed. 

In  this  example,  the  control  parameter  kt/-1+2sn,  s=0,1***,  is  linear 

with  /.  Hence,  we  may  control  the  operation  of  all  the  cells  by  means  of 
a  single  control  signal  that  Is  propagated  in  the  network.  In  Figure  5.3.  we 
augment  the  network  of  Figure  5.2  by  the  c  colored  links  and  add  to  its 

specification  the  causal  equations 


Figure  5.3  -  A  1-D  convolution  network  controlled  by  external  signals. 

if  the  signal  transmitted  on  the  Input  link  is  specified  by 

_2n . 

7,  =  n  P  ( ear ) 

I  oo 


where  a(l)=l  and  a(t)=0  for  l<f<r(a)=n.  then  it  is  easily  shown  that  the 
control  signal  on  the  input  link  c.  of  any  cell  i  Is  described  by 

[0  if  t<k+i-1 

1  If  t=k  ti  -1+2sn  ,  s2©.!.-’* 

0  otherwise 


This  shows  that  the  control  signal  will  arrive  at  each  cell  at  the  appropriate 
time. 


The  external  control  approach  is  equivalent  with  a  redefinition  of  our 
operators  under  which  the  control  indices  r.k  and  s  are  replaced  by  an 


Th 


additional  control  argument.  For  example,  the  expression  *£^  4*  used  in 
modeling  a  periodic  memory  cell  may  be  replaced  by  E(Z.y).  where  the 


nonperiodic  expansion  operator  E  is  defined  by 


(£<*,»]<*) 


[£  (|.y)](f-l) 


if  y(t)~0 


if  ytt)=1 


and  the  sequence  y  controls  the  resetting  of  the  memory  element;  that  is 


y(t)  = 


{ : 


t-r.  rtk.  r+2k. 


otherwise 


Evidently,  the  properties  of  the  operator  E  may  be  derived  directly  from 
those  of  £^.  Similar  operators  may  be  defined  for  nonperiodic  accumulators 
and  multiplexers. 


5.3.  Modeling  networks  with  direct  feed  back  loops. 

The  condition  described  by  equations  (2.3)  in  Section  2.2  does  not 
allow  for  direct  loops  in  the  graph  of  systolic  networks.  In  practice,  how¬ 
ever.  systolic  networks  that  have  two  phase  clocks  (for  stability  considera¬ 
tions)  may  contain  computational  cells  with  direct  feed-back  connections.  In 
this  section,  we  show  that  we  may  still  apply  our  model  to  describe  the 
operation  of  these  ceils.  The  iteration  operator  of  Section  2.4  will  be  user1 
for  this  purpose. 

Consider,  for  example,  the  cell  shown  In  Figure  5.4.  if  we  were 
allowed  to  use  direct  feed  back  in  the  network's  graphs,  we  could  model 


the  operation  of  this  cell  by  the  equations 


uo  =  /W'a,) 

* o  =  A2(*/'<V 


(5. 4. a) 
(5.4. b) 


with  the  condition  that  a.=aQ.  and  that  A1  and  A2  are  given  causal  opera¬ 


tor. 


r’  t" 


8 


14 


i 
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Note  that  the  link  a,  cannot  be  an  external  input  to  the  cell,  and  that 
the  link  aQ  cannot  be  an  external  output  to  the  cell.  Hence,  in  order  to 
model  the  operation  of  the  cell,  it  suffices  to  relate  the  output  0Q  to  the 
input  0..  This  relation  is  obtained  by  substituting  aQ  =  in  (5. 4. a)  and 
using  Theorem  3.2  to  solve  the  resulting  equation.  This  gives 

0o  =  V*>  • 

This  completely  describes  the  I/O  behavior  of  the  cell. 


C 


Figure  5.4  -  A  cell  Figure  5.5  -  The  internal 

With  direct  feed-back  structure  of  a  periodic  accumulator. 


As  a  specific  example,  we  consider  the  cell  whose  internal  structure  is 

shown  in  Figure  5.5  and  whose  operation  is  described  by 

v  =  /Vf}‘*  1  U  .  o)  (5. 5. a) 

C,  =  nQ  [£  +  n]  (5.5.b> 

where  (.  is  the  zero  sequence,  if  the  output  link  z  is  directly  connected  to 
the  input  link  s.  then  the  output  sequence  C  may  be  described  as  a  func¬ 
tion  of  the  input  sequence  (  only.  We  obtain  this  description  by  first  sub¬ 
stituting  o  =<  in  (5. 5. a)  and  using  the  result  in  (5.'-.b). 

C  =  nQ  U  +  .  <)J  (5.6) 


We  may  then  apply  Theorem  3.2  to  express  the  solution  of  (5.6)  in  the 


form 

C  -  /  no  t  »]*-' U  .  V)1  (5.7) 

This  is  an  explicit  I/O  description  of  the  cell.  It  is  easy  to  show  that  the 
formula  (5.7)  is  equivalent  with 

C  *  nQ  A1'*'1^);  (5.8) 

that  is  the  cell  of  Figure  5.5  acts  as  a  periodic  accumulator.  A  similar 
result  was  proved  by  Johnsson  and  Cohen  [27]  using  the  delay  operator 
defined  in  [11].  in  order  to  prove  the  equivalence  of  (5.7)  and  (5.8).  we 
consider  the  tth  element  of  C ■  By  (5.7)  we  have 

CU)  =  llv  nQ  U  +  .  7?)]](t ) 

=  tnQ  u  +  ,  c)]](f) 

which  by  the  definition  of  the  multiplexer  operator  gives 
0  f=l 

C(f)  =  <  €(/-!)  t-2.k-t-2.2kt2.  •  •  •  (5.9) 

£(f- 1)  +  C  (f — 1 )  otherwise 

Equation  (5.9)  is  a  recursive  formula  that  may  be  rewritten  in  the  form 


C(t) 


0 


na  -1 

L  l<trtj)  +  C<  ft/) 
/  =0 


f  =  l 


f>l 


(5.10) 


where  f(.  =  (t-l)-mod  ((t-2)+k)  and  na-t-tr.  Finally,  from  the  definition  of  the 
accumulator  operator  (see  Section  2.4),  it  follows  that  the  elements  of  tne 
sequence  C  in  (5.8)  are  also  give  by  (5.10).  which  proves  that  (5.7)  and 
(5.8)  are  equivalent. 
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5.4.  Verification  of  Pipelined  Operation. 

In  this  section,  we  apply  the  abstract  model  to  the  verification  of  pipe¬ 
lined  operations  in  systolic  networks.  More  specifically,  given  a  systolic  net¬ 
work  that  has  been  shown  to  perform  a  certain  computation  successfully,  we 
want  to  study  the  issue  of  repeating  the  same  computation  on  different  data 
in  a  pipelined  fashion.  Assume  that  a  certain  systolic  network  NET  has  the 
I/O  description 

Vj  =  r.C€1.'  •  -.€n>  /  =  l.---.p  (5.11) 

where  {  ..  /=1. •••.*».  77^  /=!.••  '.p.  are  the  input  and  output  sequences  of 
the  network,  respectively,  and  rf.  /=1  .  •**.p  denote  certain  causal  operators 
that  model  the  behavior  of  the  network.  Suppose  also  that  for  a  certain 
input  description 

=  nr>  a.  (5.12) 

with  given  integers  r/  and  sequences  a^..  we  were  able  to  show  that  the 

outputs  are  described  by 

Vj  =  ns/  0j  1  =  1.-  •  *.p  (5.13) 

with  specified  integers  si  and  sequences  In  other  words,  suppose  that 
when  (5.12)  is  used  in  the  equations  (5.11),  then  we  were  able  to  prove 

that 

nsl  0.  =  r.(nr1  av  •••  .nrn  ap)  /=i,---.p  (5.14) 

The  calculation  of  the  elements  of  0..  /= l.*‘*.p.  from  those  of  a.. 
/=l.***,n  using  the  network  NET  shall  be  called  the  computation  "C‘.  The 

time  of  this  computation  is  defined  as  the  time  required  by  NET  to  com¬ 

plete  C  from  the  moment  when  the  first  non-0  input  entered  NET  to  the 
moment  when  the  last  non-fl  output  was  produced.  More  precisely. 


90 


Time  (C)  =  maxf  T( ff  $jK  K/<pJ  -  minf  rj ;  K/</»; 


(5.15) 


where,  as  usual.  T  is  the  termination  function  defined  in  Section  2.1. 

Often,  it  is  desirable  to  repeat  the  computation  C.  say  m  times,  with 
different  data  sets  A®=(a®.7  =1.- •  -n).  e=1.***.m.  Let  us  denote  these  m 
instances  of  C  by  C®.  e=1. •••./»».  In  many  networks,  this  may  be  accom¬ 
plished  by  pipelining  c\  •••,  Cm .  The  time  difference  between  the  initia¬ 
tions  of  two  successive  instances  C®  and  C®  +  1  will  be  defined  as  the  pipe 
separation  r  of  the  computation  C.  in  this  case,  the  inputs  for  the  dif¬ 
ferent  instances  of  C  should  be  pipelined  on  the  input  links.  That  is 
equation  (5.12)  for  the  input  sequences  should  be  replaced  by 


«;  =  nr'  Pj.  (a®) 
I  e-\.m  i 


/  =  1 .  •  •  *  ,n 


(5.16) 


where  we  used  the  asterfx  in  l.  to  indicate  that  the  sequences  represent 
the  input  data  during  the  pipeline-operation.  We  will  also  use  £®  to 
represent  the  inputs  (5.12)  for  a  specific  instance  Ce  of  the  computation. 
This  *  and  e  superscript  notations  will  be  employed  from  now  on  for 
sequences  on  any  communication  link. 

if  the  computation  can  be  successfully  pipelined  on  NET  with  a  separa¬ 
tion  r.  then  by  using  the  inputs  (5.16)  in  the  network  I/O  description  (5.11). 
we  should  be  able  to  prove  that  the  output  sequences  during  the  pipeline- 
operation  are  described  by 


*;  •  n*'  Pi..  <4f» 


/  =  1 .  •  •  •  ,p 


(5.17) 


In  order  to  ensure  a  successful  pipeline-operation,  the  pipe  separation 
r  must  be  large  enough  so  that  the  Inputs  of  the  different  instances  C®  do 
not  overlap  and  the  corresponding  outputs  do  not  overwrite  each  other. 
The  first  condition  implies  that  r>T(a®).  /=1. •••./?.  and  the  second  that 
7>T03®),  /=!.••  *.p.  In  other  words,  the  minimum  pipe  separation  r  (C) 
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for  the  computation  C  is  equal  to  the  maximum  span  of  aff  the  input  and 
output  sequences  in  C.  where  the  span  of  a  sequence  is  defined  as  the 
time  difference  between  the  first  and  the  last  significant  elements  (non-0  or 
non  zero  elements)  in  the  sequence  plus  1.  that  Is  the  time  during  which 
the  sequence  carries  information  relevant  to  the  computation.  Hence,  a 
network  that  can  be  used  to  pipeline  a  computation  C  with  a  pipe  separa¬ 
tion  7^(0  achieves  maximum  efficiency,  from  the  viewpoint  of  the  pipelined 
operation. 

In  order  to  derive  (5.17)  from  (5.16)  and  (5.11)  without  repeating  the 
effort  spent  in  proving  (5.14).  we  use  the  negative  shift  operator  and  the 
equation  (5.12)  to  rewrite  the  pipelined  Input  (5.16)  as 


/  =  1 .  •  •  •  ,n 


(5.18) 


Here  4®  are  the  Inputs  that  would  be  used  if  the  instance  Ce  of  C  had 
been  performed  on  NET  without  any  pipelining.  Next,  we  substitute  (5.18) 
into  the  network  I/O  description  (5.11)  and  obtain  for  /=!,••  •  ,p 

v*  *  r  unrl  _<n"rl€ !>].••  *.[ nrn  p7  .  (n"rn*®)])  (5.19) 

i  i  e-i.m  i  e=l.m  n 

The  remainder  of  the  proof  is  based  on  the  use  of  the  different  pro¬ 
perties  in  Appendix  A  for  factoring  the  shift  and  the  piping  operators  out  of 
the  causal  operator  IV.  If  the  computation  can  be  successfully  pipelined 
through  NET.  then  we  should  be  able  to  transform  (5.19)  into  the  form 


<  «  n5'  P^hmc  n~sl  O  >  '-!•••••* 


(5.20) 


which  by  (5.11)  and  (5.13)  directly  reduces  to  (5.17). 

it  should  be  noted,  however,  that  there  exist  computations  for  which 
there  is  no  7-value  for  which  (5.20)  is  derivable  from  (5.19).  This  means 
that  such  computations  cannot  be  pipelined.  On  the  other  hand,  we  can 
identify  a  class  of  computations  for  which  pipelining  is  always  possible.  The 
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term  ’inert'  shall  be  used  to  identify  computations  in  this  class.  More 
specifically,  a  computation  C  on  a  systolic  network  NET  is  called  inert  if 

1)  At  its  initiation.  C  does  not  care  about  the  data  on  the  non  Input 

communication  links  of  NET.  that  is  we  may  assume  that  at  time  t=l. 
the  data  in  any  non  input  sequence  are  0's.  This  implies  that  any 
delay  in  NET  should  be  modeled  using  the  shift  operator  and  not  the 
zero  shift  operator. 

2)  Only  O-regular  operators  are  used  for  modeling  the  cells  in  NET. 
This  implies  that  the  network  does  not  treat  0  as  a  special  symbol. 

it  is  always  possible  to  pipeline  an  inert  computation  C  through  the 

corresponding  network  NET.  In  fact  we  may  simply  chose  the  pipe  separa¬ 
tion  r  to  be  the  time  of  the  computation  defined  by  (5.15).  With  this  value 
of  r.  C  does  not  start  before  C  is  terminated.  Of  course,  we  are  not 
interested  in  such  large  values  of  r.  and  hence,  the  problem  arises  of 
finding  the  least  value  of  r  for  which  (5.20)  is  derivable  from  (5.19). 

As  should  be  clear  from  the  above  discussion,  the  ability  to  derive 
(5.20)  from  (5.19)  is  the  major  issue  in  verifying  the  pipeline  operation  of 
any  systolic  network,  and  this  ability  depends  principally  on  the  value  of  r. 
However,  for  any  inert  computation  C,  we  know  that  there  exist  a  value  for 
which  (5.20)  is  derivable  from  (5.19).  In  order  to  find  the  least  possible  r. 
we  start  with  r  =  rm(C )  and  proceed  to  factor  out  the  shift  and  piping 

operators  from  (5.19)  until  we  either  reach  (5.20).  which  is  our  goal,  or  we 

cannot  continue  the  factorization  because  of  a  small  value  of  r.  in  the 
latter  case,  we  increase  r  appropriately  and  repeat  the  derivation  procedure. 
EXAMPLE: 

As  an  example,  we  consider  once  more,  the  modified  1-D  convolution 
network  of  Section  5.1.  recall  that  the  network  I/O  description  was  given  by 
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(5.21) 


and  that,  when  the  inputs  for  a  certain  instant  of  'e  computation  C 9  are 
specified  by 


_e 


.*-1 


6  r 


a"  =  n 
nBk  =  0  te 
p®  =  e  u® 


then  the  output  is  described  by 

_e  .2k  - 1  .  e 
°ktl  =  n  0  * 


(5. 22. a) 
(5.22.  b) 
(5.22.C) 


(5.23) 


The  detailed  forms  of  the  sequences  t.®.  £®  and  u®.  containing  the  input 
data  for  C®,  and  the  sequence  i)6  containing  the  result  of  C®.  were  speci¬ 
fied  earlier. 

it  is  easy  to  see  that  this  convolution  computation  is  inert.  Hence,  it  is 
always  possible  to  pipeline  different  instances  of  the  computation  on  the 
same  network.  In  this  case,  the  minimum  pipe  separation  during  the  pipe¬ 
lined  operation  is  rm  =  max(2n ,  2k.  2n-2lk- 1)7  =  2n .  If  m  instances  are 
pipelined  through  the  network  with  the  minimum  separation,  then  the  inputs 
should  have  the  forms 

-(k- 1)  ev 


o'  =  n*-1  pf-lm<ei®>  =  nk  1  pf"  (: n  "  o~) 

i  0  =  1, m  0  =  1  ,m  1 


(5. 24. a) 


nk  =  Pf-1  m(0^e)  =  Pe -1  m  ^ 
k  e-i.m  e  =1  ,m  k 


(5.24.0) 


f>?"  ceu'i  - 

e  =  l.m  e-i.m  k 


(5.24  c) 


Using  the  pipelined  input  (5.24)  in  the  network  I/O  description  (5.21).  we 
obtain  the  output  of  the  pipelined  operation,  namely 


&  V 


(n“(/f-1)o®)  +c  n2/_1 


*  2k  -1  -2n 

°k  + 1  =  n  e=l.m'“  W1‘ 

P^n  (O®)]  *  P20  (77®)  J 

"2k -2/ +  1  e=l.mipkJ  e  =  l.mlwkJ 


(f£2n 


Next,  we  use  the  Properties  PI 6  and  P1.1  fn  Appendix  A  to  rewrite  this  as 


am  =  n2*"1  P2n 
°*+l  n  ^8=1. m 


vr*-"#  t  i  n2/-1 


p2n  [Z&n  9 

e  =  l. m  C£2A-2/  +  l  (p/t) 


(5.25) 


Using  the  fact  that  raEg”_2^+1  <  2n.  and  applying  Property 

P8.3.  we  may  reduce  (5.25)  to 


*  _  2k- 1  p2n 

°*t1  "  n  Pe=l.m 


tnV  ♦  E  n2'” 


'1 


Finally,  from  (5.21)  and  (5.23)  we  obtain 

.) 


*  _2k-l  _2n 

°*t1  ■  n  ..  /»=l.m<n 


=  n2*-’  p2*  (©  „•> 

8=i.m 


This  proves  that  the  output  of  the  m  instances  will  appear  on  the  output 
link  s^t1  at  a  rate  equal  to  the  Input  data  rate,  namely  the  output  of  one 
instance  every  2n  time  units. 

Note  that  the  technique  suggested  in  this  section  separates  the  verifica¬ 
tion  of  the  pipelined  operation  from  the  verification  of  the  correct  execution 
of  one  instance  of  the  computation.  This  separation  leads  to  a  clearer 
logic  and  simpler  proofs. 


5.5.  Self  timed  systolic  networks. 

So  far.  we  have  applied  the  abstract  model  to  clocked  systolic  net¬ 
works.  that  is.  systolic  networks  that  are  synchronized  by  a  global  clock.  In 
this  section,  we  show  that,  with  a  slight  modification,  the  model  may  be 
applied  to  self  timed  systems  as  well  [501.  In  order  to  explain  the  differ¬ 
ence  between  the  two  types  of  systems,  we  first  generalize  the  definition  of 
systolic  networks  to  include  any  network  in  which  computational  ceils  have  a 
basic  cycle  that  is  repeated  indefinitely,  unless  the  ceil  is  forced  to  halt 
externally.  After  the  initiation  of  a  cycle  in  a  computational  cell,  the  cell 
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reads  its  input  from  the  input  links,  performs  a  specific  computation  and 

then  produces  its  results  on  the  output  links.  Here,  we  will  assume  that  a 
cycie  terminates  with  the  initiation  of  the  next  cycle. 

in  clocked  systolic  networks,  the  cycle  of  all  the  cells  are  initiated 
simultaneously  by  an  external  global  signal  (a  clock)  that  is  broadcasted  to 
every  cell.  The  duration  of  the  successive  cycles  is  the  same  and  is  often 
called  "one  time  unit'.  We  repeatedly  used  this  terminology  in  our  discus¬ 
sion. 

On  the  other  hand,  the  initiation  of  the  cycles  in  self  timed  systolic 
networks  is  not  synchronized,  and  each  cell  determines  locally  the  instant  at 
which  its  cycles  should  start.  Many  protocols  may  be  applied  to  organize 

seif  timed  systems.  Here,  we  consider  an  organization  that  is  directly  sug¬ 
gested  by  our  abstract  model. 

Assume  that  the  initiation  of  the  first  cycle  is  synchronized  by  a  reset 
signal  m  all  the  ceils  In  the  self  timed  network,  and  that  a  new  cycle  in 
any  particular  cell  is  initiated  at  the  instant  when  the  cell  produces  the  last 
output  of  its  current  cycle.  Assume  also  that  each  communication  link  in 

the  network  is  augmented  by  a  pair  of  Request/Acknowledgment  lines, 
denoted  here  by  REQ  and  ACK.  respectively.  These  lines  are  used  to 
implement  a  2-cycle,  non-return  to  zero  shake-hand  protocol  [50]  between 
the  sender  S  and  the  receiver  R  of  the  data  on  any  communication  link. 
REQ  and  ACK  may  carry  a  single  bit  (0  or  1)  from  S  to  R  and  from  R  to 

S.  respectively. 

The  protocol  is  breifly  explained  as  follows:  S  does  not  send  data  on 
the  communication  link  unless  REQ  and  ACK  are  in  the  same  state  (both  0 
or  both  1).  After  sending  the  data,  S  changes  the  state  of  REQ  signaling 
that  the  link  contains  valid  data.  When  R  senses  that  REQ  and  ACK  are  in 


different  states,  it  reads  the  data  and  changes  the  state  of  ACK  indicating 
that  it  received  the  data  and  that  it  is  ready  to  receive  new  data.  More 
descriptively,  after  the  first  cycle  is  initiated,  each  ceil  executes  the  follow¬ 
ing  algorithm  indefinitely  (unless  externally  forced  to  halt): 

ALG4 

1)  Wait  until  the  REQ  and  ACK  lines  associated  with  each  input  link 
are  in  opposite  states. 

2)  Read  the  input  data  and  change  the  state  of  the  corresponding 
ack  lines. 

3)  Perform  the  computation  on  the  input  data 

4)  Wait  until  the  REQ  and  ACK  lines  associated  with  each  output  link 
are  in  the  same  state. 

Si  Put  the  results  on  the  output  links  and  change  the  state  of  the 
corresponding  REQ  lines. 

6)  initiate  a  new  cycle  by  going  to  step  1. 

Note  that  any  REQ/ACK  pair  of  lines  associated  with  a  network  input  link  is 
initially  set  such  that  REQ  and  ACK  are  in  the  same  state.  This  indicates 
that  the  link  Is  ready  to  accept  external  Input.  On  the  other  hand,  the 
REQ  and  ACK  lines  associated  with  any  other  link  in  the  network  are  ini¬ 
tially  set  to  opposite  states  to  indicate  that  a  certain  data  item  is  initially 
present  on  the  communication  link  (this  item  may  be  0). 

With  this  general  definition  of  a  cycle  in  a  systolic  network,  we  gen¬ 
eralize  the  physical  interpretation  of  part  [A3}  of  the  abstract  model  (see 
Section  2.2)  to  allow  for  its  application  to  systolic  networks  irrespective  of 
the  method  used  for  the  synchronization  of  their  operation.  More  specifi¬ 
cally.  if  xy  is  an  OUT  edge  of  a  node  u  in  the  graph  of  the  network,  then 
tne  sequence  4  associated  with  x  is  interpreted  as  follows. 


1)  If  node  o  is  a  source  node,  that  is.  if  x is  a  network  input  link, 
then  £y(f)  is  the  tth  data  item  that  Is  externally  transmitted  on  xy  from 
the  input  source. 

2)  If  node  u  is  an  interior  node,  then  {y(l>  is  the  data  item  that 
appears  on  xy  at  the  beginning  of  the  operation  of  the  network,  and  for 
any  t>1,  ly(t)  is  the  data  item  that  is  placed  on  xy  by  the  computa¬ 
tional  cell  u  as  the  result  of  its  (t-T)at  cycle. 

With  this  general  interpretation  of  data  sequences,  a  systolic  network 
has  to  satisfy  some  constraints  in  order  to  ensure  that  the  causal  equations 

=  r'u(Vu‘ ’  "■Vu)  /  =  !.••♦./>  (5.26) 

associated  with  an  interior  node  u  indeed  model  the  computation  of  the 
corresponding  cell.  Here.  •  •  •  .y™  and  x1 ,  •  •  •  ,xn .  are  the  IN  and  OUT 
edges  of  u.  respectively.  These  constraints  are: 

Cl  :  The  computational  cell  corresponding  to  any  interior  node  u  will 
not  be  blocked  and  will  continue  its  execution  for  infinitely  many 
cycles  (unless  externally  forced  to  halt). 

C2  :  For  any  /'.  K/<m  and  t.  f>l,  the  communication  pattern  in  the 
network  will  ensure  that  y^U)  is  the  data  read  in  by  the  cell  u  from 
the  link  during  its  tth  cycle. 

If  constraints  Cl  and  C2  are  satisfied  in  a  systolic  network,  then  the  model 
of  Section  2.2  may  be  applied  to  the  specification  and  verification  of  the 
network. 

For  clocked  systolic  networks.  Cl  is  automatically  satisfied  as  each 
cycle  is  initiated  by  a  global  clock  that  is  supposed  to  run  continuously,  in 
order  to  satisfy  the  second  constraint  C2,  we  may  assume  the  following:  l) 
No  cell  places  the  result  of  the  tth  cycle  of  its  computation  on  the  output 
links  before  the  end  of  the  cycle,  2)  the  duration  of  each  cycle  is  taken  to 


be  at  least  equal  to  the  time  required  by  the  slowest  cell  in  the  network  to 
complete  its  computation,  and  3)  the  actual  reading  of  the  data  by  any  cell 
does  not  start  less  than  a  certain  time  A  after  the  initiation  of  a  cycle, 
where  A  Is  the  time-span  necessary  for  the  signals  on  the  communication 
links  to  stabilize.  These  assumptions  are  usually  implicitly  made  when 
clocked  systolic  networks  are  discussed. 

For  seif  timed  systolic  networks,  a  shake  hand  protocol  should  be  used 
in  order  to  ensure  that  the  constraints  Cl  and  C2  are  satisfied.  This  pro¬ 
tocol  should  also  be  obeyed  in  all  external  interactions  (inputs  and  outputs) 
with  the  network.  For  example,  if  the  protocol  defined  by  ALG4  is  used  to 
synchronize  the  operation  of  the  cells,  then  all  inputs  and  outputs  to  the 
network  must  satisfy  the  following  rules: 

1)  A  data  item  is  not  transmitted  on  a  network  input  link  unless  the  associ¬ 
ated  REQ  and  ACK  lines  are  in  the  same  states.  The  state  of  REQ  should 
be  changed  after  the  data  is  placed  on  the  link. 

2)  For  any  input  link  xf  .  all  the  elements  in  the  infinite  sequence  tjfJ. 

including  don't  cares,  are  transmitted  on  xjf).  However,  a  0  item  may  be 
transmitted  by  simply  changing  the  state  of  ACK  without  placing  any  signifi¬ 
cant  data  on  the  data  lines  of  xjn. 

3)  For  any  network  output  link  xQ.  every  output  item  must  be  collected, 

even  if  there  is  no  Interest  In  its  value.  In  this  case,  the  collection  of  the 
data  is  simply  achieved  by  changing  the  state  of  the  ACK  line  associated 

with  Xq  . 

It  is  clear  that  the  protocol  described  In  this  section  ensures  that  dur¬ 
ing  its  tth  cycle,  any  cell  u  will  read  the  tth  elements  appearing  on  its 

input  links  y^ .  •  •  •  ,y™ .  provided  that  these  elements  are.  at  some  point  of 
time,  transmitted  on  the  corresponding  links.  Note  that  the  only  reasons 


that  may  prevent  the  tth  element  7 )^(t)  from  ever  being  transmitted  on  y'u 
are  1)  is  a  network  input  link  that  Is  not  supplied  with  data  items,  or  2) 
yfj  is  an  OUT  edge  of  an  interior  node  v.  and  the  execution  of  the  cell  v 
was  blocked  before  the  completion  of  Its  (f-l)sf  cycle.  If  we  always  supply 
the  inputs  when  required,  it  is  clear  that  the  network  will  satisfy  the  con¬ 
straint  C2  if  only  the  first  constraint.  Cl.  is  satisfied. 

in  the  literature  of  distributed  processing,  the  constraint  Cl  is  usually 
called  the  "liveness  property’  and  may  be  formally  verified  with  the  help  of 
the  so  called  temporal  logic  [371.  This  formal  verification  is  beyond  the 

scope  of  this  dissertation.  However,  we  illustrate  here  the  steps  of  a  for¬ 
mal  proof  by  the  following  informal  argument:  Assume  that  any  cell  in  the 
network  does  eventually  complete  its  cycle  for  any  f>1.  This  implies 

that  for  any  communication  link  in  the  network,  (u(t)  will  eventually  be 
transmitted  on  x  In  other  words,  the  Inputs  of  the  tth  cycle  of  any  cell 
in  the  network  will  eventually  appear  on  the  input  links  of  that  cell,  and 

hence,  each  cell  will  eventually  read  its  appropriate  inputs  and  thereby  free 
its  input  links.  This,  together  with  the  fact  that  the  outputs  of  the  network 

will  eventually  be  read,  leads  to  the  conclusion  that  any  link  in  the  network 

st 

will  eventually  be  ready  to  receive  its  (f+1)  element.  Hence,  each  cell  in 

the  network  will  be  able  to  output  the  results  of  its  tth  cycle,  which  means 

that  every  cell  in  the  network  will  eventually  complete  its  tth  cycle.  Adding 
to  this  a  demonstration  that  every  cell  will  eventually  complete  its  first  cycle, 
we  may  show  that  each  cell  in  the  network  will  execute  infinitely  many 

cycles,  thus  satisfying  the  constraint  Cl. 

Finally,  we  note  that,  in  the  organization  for  self  timed  systems  dis¬ 
cussed  here,  the  role  of  the  don't  care  elements  Is  very  crucial.  More 

specifically.  0  is  interpreted  as  a  data  item  rather  than  a  'nothing'.  in 


other  organizations  of  self  timed  systems,  the  operation  of  each  ceil  does 
not  start  before  the  cell  receives  significant  input  data  (  non-0  items). 
Systems  of  this  type  may  result  in  a  dead-lock  situation,  thus  violating  the 
constraint  Cl.  Clearly,  the  organization  and  verification  of  self  timed  sys¬ 
tems  is  still  a  wide  open  area  for  research. 
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6.  INTRODUCTION  TO  FINITE  ELEMENT  ANALYSIS. 

Finite  element  analysis  is  a  technique  widely  used  by  engineers  [57.42] 
and  applied  mathematicians  [46.51]  for  solving  boundary  value  problems  for 
partial  differentiai  equations.  In  the  linear  case,  a  majority  of  these  boun¬ 
dary  value  problems  can  be  formulated  as  a  variational  problem  of  the  fol¬ 
lowing  form: 


Given  two  Hilbert  spaces  and  K  an  appropriate  bilinear  opera¬ 
tor  &  and  a  corresponding  linear  functional  2  on  find  the 

function  yjceJ^such  that 

,<p)  =  for  all  v  €  .yez  (6.1) 

For  example,  consider  the  2-dimensional  heat  conduction  problem  In  the 
closed  domain  Q  shown  in  Figure  6.1(a),  where  the  curved  part  of  the 
boundary  3Q  (denoted  by  3Q,j)  is  thermally  insulated  and  the  temperature 
distribution  on  the  straight  part  of  the  boundary  3Q2  =  3Q  -  8<?1  is 
forced  to  be  equal  to  a  given  function  g(x.y).  If  <p(x.y>  and  f(x.y)  denote 

the  temperature  and  the  rate  of  heat  generation  at  any  point  (x.y)e 0 . 
respectively,  then  the  equations  governing  the  heat  flow  q  -  are 


3  d<p  3  d<fl  .  ,  . 

a-w  ~  +  -f-  w  =  -f  (x  ,y) 

ox  x  3  x  dy  y  dy 

¥  =  o 

3  n 

v  =  g 


on  Q 
on  301 
on  30, 


(6. 2. a) 
(6.2.b) 
(6.2.C) 


W  *-* 


Here  w  and  w  are  functions  that  depend  on  the  material  properties  (e  g. 
x  y 

specific  heat)  and  ^  is  the  derivative  of  u  with  respect  to  the  outward  unit 

normal  to  the  boundary.  The  variational  problem  corresponding  to  equation 

1  2 

(6.2)  is  to  find  a  function  <p(x.y)eW  '  (Q)  such  that 
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(a)  The  domain 
of  the  problem. 


(b)  The  finite 
element  mesh. 


(c)  Local  node  num¬ 
bering  for  element  5. 


Figure  6.1 


Jq  ("x  IJ  17  +  wy  17}  dx  dy  +  lao.  V(<P  ~  1#  dS  = 


a  Qr 


an 


J0  f  v  dx  dy  t  Jao^  g  *  ds 


for  all  veWh2(Q) 


where  is  a  line  integral  evaluated  on  the  boundary  a<?2.  The  defini- 


,1.2 


may  be  found  in  books  on  functional 


tion  of  the  Sobolev  space  W 
analysis  such  as  [1]. 

in  this  dissertation,  we  will  restrict  ourselves  to  variational  problems  in 
which  Q  is  a  bounded  domain  in  a  two  dimensional  space  with  coordinates 

are  identical  and  from  now  on 


(x.y),  and  the  Hilbert  spaces  JS?,  ana 


3T 


called  *e  Moreover,  we  will  assume  that  the  operator  and  the  func¬ 
tional  $  have  the  general  forms 


-i 


.  i'i i~« it l~i  «'■  l  i  'n  1i  Vi 
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#<v.*)  =  L  E  ar  I(x-y)  Dvix.y)  D.(f>(x  .y)  dx  dy  t  &P°  (6.3.a) 

w  r  ,/=G  rj  r  1 

jftr)  -  J  tfx.y)  v(x.y)  dx  dy  +  (6.3.b) 

where  f  and  af  f.  r.l=  0,1.2,  are  given  functions.  9”>  ana  9>'  are 

line  Integrals  over  the  boundary  3Q  of  Q.  and  D2  are  the  differential 
d  d 

operators  and  -gp.  respectively,  and  Oq  is  the  Identity  operator,  that  is 
Dq<p=<p.  The  form  of  the  Integrals  and  will  not  be  specified  in 

detail  as  this  is  not  crucial  for  the  purpose  of  our  discussion. 

The  finite  element  process  for  the  variational  problem  (6.1)/(6.3)  begins 
with  the  specification  of  a  mesh  that  divides  Q  Into  m  finite  elements  Qe. 
e*1  .•••.m  (e.g.  see  Figure  6.1(b)).  In  addition  to  its  geometric  shape, 

each  element  is  identified  by  a  number  of  nodes.  With  each  node,  we 
associate  a  basis  function  which  is  a  piece-wise  continuous  function  that 
equals  one  at  that  node  and  zero  at  any  other  node. 


«.! 


V. 


(a;  A  three  nodes 
triangular  element 


(b)  A  four  nodes 
quadrilateral  element 


(c)  A  nine  nodes 
Lagrangian  element 


Figure  6.2  -  Some  element  types 


In  the  following  chapters,  we  will  make  the  reasonable  —  albeit  some¬ 
what  restricting  —  assumption  that  all  the  elements  in  the  mesh  covering  Q 
are  of  the  same  type,  and  that  each  has  k  nodes  (Figure  6.2  shows  some 
element  types  frequently  used  in  practice).  Each  node  in  a  specific 
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element  e,  l<e<m.  may  be  locally  Identified  by  a  pair  of  indices  to./),  for 
some  /.  Alternatively,  a  global  scheme  may  be  used  to  identify 

each  node  by  a  unique  integer  /,  1</<n,  where  n  is  the  total  number  of 
distinct  nodes  in  the  mesh.  The  relation  between  the  local  label  to./)  of  a 
node  and  its  global  label  i  is  defined  by  some  mapping 
g/ob:[l.m]x[i,/f]  -•  tl.nl.  where  j=glob  to ./).  Accordingly,  we  may  define  for 
each  element  e.  the  boolean  matrix  M9  of  order  kxn  such  that  for  K/</c. 
1 </<n,  we  have 

r-  1  //  glob  (e  .1 )  =  / 

M9a.i)  =  |  (6.4) 

0  otherwise. 

The  matrices  M0 ,  e=l, •••./»»  and  their  transposes  will  play  an 

important  role  in  the  finite  element  analysis. 

As  an  example,  consider  the  mesh  in  Figure  6.1(b).  where  each  finite 

element  is  labeled  by  a  number  e.  •»!.♦♦  *.12  (written  inside  a  circle),  and 

each  node  has  been  given  a  global  label  /.  K/<20  (written  next  to  each 

node).  Figure  6.1(c)  isolates  the  finite  element  e=5  and  gives  each  of  its 

nodes  a  local  number  such  that 

glob  (5.1)  =  6 
glob  (5.2)  =  7 
glob  (5.3)  =  11 
glob  (5  A)  =  10 

Thus 

00000100000000000000 
00000010000000000000 

/vr  = 

00000000001000000000 
0  0000000010000000000 

Given  a  finite  element  mesh,  we  reformulate  the  integrals  in  (6.3)  as 
the  sum  of  integrals  over  the  finite  elements 


rr 
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=  c  ij  E  /  V  °I*  *  ♦  «^°'S  J 

0=1  or  r  ,/=0  r  1 


(6. 5. a) 


i&>  =  £  { f  f  v  dx  dy  +  fP^‘B  i 
^.1  * V 


(6.5.D) 


0  =  1  Q 


here  6^Q‘a  and  ^1,e  are  the  parts  of  and  evaluated  on  the 


boundary  of  element  e  if  this  boundary  intersects  with  3Q.  and  otherwise. 


y°*  =yie  =  o. 

Now.  the  space  je  of  the  functions  <p  and  v  in  (6.5)  is  replaced  by  a 


space  of  piece-wise  spline  functions  over  Q;  that  is  <p  and  v  are  approxi¬ 


mated  on  each  finite  element  0  by 


<p(*.y)  =  E 

•  1  0./  e.i 

k 


(6. 6. a) 


v(x.y)  =  £  ve  .  t&  j<x.y> 


(6.6.b) 


where  <p„  .  and  v  .  are  the  values  of  <p  and  v  at  the  node  (e.i).  respec- 
tively.  and  each  /  is  the  basis  function  associated  with  node  (e.i).  With 

the  approximation  (6.6)  in  (6.5),  it  turns  out  that  the  values  *>0  /  of  the 


approximate  solution  <p(x,y)  at  the  nodes  of  the  mesh  satisfy  a  linear  system 


of  equations  of  the  form 


H  u  =  b 


where 


1)  u  is  an  n-dimenslonal  vector  such  that  its  r  component  u.  is 


the  value  of  <p  at  the  global  node  i.  that  is,  if  i=glob(e ./).  then 


W /• 


2)  H  is  an  nxn  banded,  symmetric,  positive  definite  matrix  called  the 


global  stiffness  matrix.  With  the  matrices  M  of  (6.4).  H  may  be 


expressed  as  the  sum 
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H  =  E  M°  H*  M® 
e  =1 


of  elemental  matrices  H®  =  H®  +  s®.  e=l. •••./?».  For  a  specific 
element  e.  the  (/./)r/>  entry  of  the  matrix  rt®  is  computed  by 

the  formula 


**'  *  r.Ro  'L*  ‘rJ  °'V/  dX 


Then,  H0  is  obtained  by  adding  to  each  entry  w®  the  term 
S0 1  -  resulting  from  the  discretization  of  <f0'9  in  (6. 5. a). 

This  term,  l/*®'0.  is  a  line  integral  that  Is  not  equal  to  zero  exactly 
when  both  nodes  (eJ)  and  (e./)  lie  on  the  boundary  dQ.  Hence, 
most  of  the  elements  of  the  matrices  S0 -  (S®^).  e=l.***.m  are 
zeroes.  Moreover,  if  the  boundary  conditions  associated  with  the 
problem  are  of  the  natural  type  [57],  then  the  term  disappears 

from  (6. 3. a)  and  all  the  matrices  S®  become  zero  matrices.  In  fact, 
for  any  finite  element  problem,  the  work  associated  with  the  computa¬ 
tion  of  S®  is  negligible  compared  with  that  required  for  the  computa¬ 
tion  of  H6 . 

3)  b  is  an  n-dimensional  vector  called  the  global  load  vector  which 
may  be  expressed  as  the  sum 


m  y  —  „  m  _ 

b  =  £  M0T  lb0  ♦  s®]  =  £  M0j  b9 

e =1  e  =1 


(6.10) 


where  the  components  of  the  vector  s  are  line  integrals  over  dQ 

th  6 

and  the  /  component  of  the  elemental  vector  b  is  given  by 


bl  -  SQ9  '  *0.1  * 


(6.11) 


The  linear  system  of  equations  (6.7)  may  be  solved  either  by  a  direct 


method  or  by  an  iterative  scheme.  Direct  solution  techniques  are  based  on 
the  decomposition  of  the  positive  definite  symmetric  matrix  H  into  two  or 
more  matrices  that  have  nice  properties,  and  then  transforming  the  system 
(6.7)  into  a  number  of  simpler  systems.  For  example,  assuming  that  H  is 
decomposed  into  the  product  of  a  lower  and  an  upper  triangular  matrices  L 
and  U.  respectively,  then  the  solution  u  may  be  obtained  by  first  solving  the 
lower  triangular  linear  system  Ly=b  and  then  using  its  solution  y  to  solve 
the  upper  triangular  system  Uu=y.  The  solution  of  the  two  triangular  sys¬ 
tems  is  relatively  simple,  and  hence,  most  of  the  work  involved  with  the 

solution  of  (6.7)  is  in  the  factorization  H=LU.  An  advantage  of  the  direct 

solution  is  that  the  factorization  of  H  does  not  have  to  be  repeated  if  we 

desire  to  solve  Hu=b '  for  a  different  right  side  vector  b'*b.  which  Is  the 
case  in  some  problems  where  the  finite  element  analysis  is  to  be  performed 
for  different  load  functions. 

Alternatively,  iterative  solvers  start  by  assuming  an  initial  guess  </°  to 

X 

tne  solution  u  .  followed  by  the  application  of  an  iterative  scheme  for 

1  2  * 

obtaining  successive  approximations  u  ,u  ,  •  •  •  of  u  .  The  convergence  of 

1  2  * 

the  iterates  u  .u  .  •  •  •  to  the  solution  u  depends  on  both  the  initial  guess 
u°  and  on  the  procedure  used  to  derive  u'  from  t/-1.  in  Section  9.2,  we 
will  consider  iterative  solvers  in  more  details. 

in  summary,  the  linear  finite  element  analysis  involves  essentially  the 

following  four  computational  steps:  1)  Generation  of  the  finite  element  mesh. 

2)  generation  of  an  elemental  stiffness  matrix  H9  and  an  elemental  load 

vector  b0  for  each  finite  element  e.  e=l.***.m.  3)  assembly  of  the  global 
stiffness  matrix  H  and  load  vector  b.  and  4)  solution  of  the  linear  system  of 


equations  H  u  =  b. 


w 


Finally,  we  note  that  the  solution  vector  u  of  (6.7)  defines  the  function 

<p  only  at  the  nodes  /=1. •••./».  Given  u.  the  value  of  y>  at  any  other  point 
(x.y)eQ  may  be  obtained  from  the  Interpolation  formula  (6.6. a). 

Remark  1: 

In  the  previous  discussion,  we  assumed  that  <f>  is  a  real-valued  func¬ 

tion.  However,  the  finite  element  analysis  Is  also  applicable  if  w  is  a  map¬ 
ping  into  R?  that  Is:  a  function  with  d> 0  degrees  of  freedom.  in  this 
case,  the  coefficients  ar  yOr.y)  and  the  load  fCx.y)  in  the  variational  formu¬ 
lation  (6.1/3)  become  dxd  matrices  and  d-dlmensional  vectors,  respectively. 
But  the  basic  finite  element  technique  remains  the  same  and  all  the  above 
formulas  are  valid  with  the  following  interpretations: 

1)  Each  component  u.  of  the  vector  u  is  a  d-dimensional  subvector 
that  contains  the  values  of  the  d  components  of  <p  at  node  i. 

2)  Each  entry  H9  ^  in  the  elemental  matrix  H9  is  a  d  xd  submatrix. 

and  each  entry  b9  in  the  elemental  load  vector  b0  is  a  d- 

dlmensionai  subvector. 

3)  The  entries  of  the  M 9  matrices  of  (6.4)  are  d*d  unit  matrices  or 

zero  matrices  instead  of  ones  and  zeroes,  respectively.  Hence,  the 

order  of  the  linear  system  of  equations  (6.7)  increases  from  n  to  nd. 

Remark  2: 

in  some  problems,  it  is  natural  to  choose  the  function  space  ve  -  X0 
such  that  any  function  vistt 'Q  is  equal  to  zero  on  a  specified  part  3<5QcaQ 
of  the  boundary  3Q.  Then,  the  basis  functions  .  associated  with  the 
nodes  (e.i)edQQ  should  be  excluded  from  the  expansion  (6.6).  Although 
this  decreases  the  dimension  of  the  elemental  arrays  H 9  and  b9  for  the 
elements  that  have  common  boundaries  with  3Qq.  it  has  the  disadvantage  of 


causing  nonuniformity  in  the  computation  of  the  different  elements.  A 
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common  method  for  retaining  the  uniformity  of  the  computation  is  to  ignore 
this  condition  and  to  include  alt  the  basis  functions  in  (6.6).  Then  for  each 
node  (e.i)€dQQ,  the  entries  of  H ®  and  b ®  are  changed  such  that  b®  =  0. 
H®  =o  for  1  /  *i .  and  1.  This  Is  equivalent  with  replacing  the 

Ith  equation  in  the  linear  system  (6.7)  by  the  equation  u ,  =0.  which  guaran¬ 
tees  that  the  solution  is  a  member  of  the  space  K 


7.  A  SYSTOLIC  SYSTEM  FOR  THE  GENERATION  OF  THE  ELEMENTAL 
ARRAYS. 


The  purpose  of  the  systolic  system  presented  in  this  chapter  is  to  gen¬ 
erate  the  elemental  arrays  H9  and  b9 .  e=1  for  a  given  finite  ele¬ 

ment  problem  and  a  specific  mesh  on  its  domain  Q.  in  order  to  simplify 
the  design  and  the  description  of  the  system,  we  assume,  as  in  Chapter  6. 
that  all  elements  are  of  the  same  type,  and  hence  that  the  number  k  of 
nodes  per  element  is  the  same  lor  all  of  them. 


in  most  practical  problems,  the  coefficients  ar  r=0.1.2.  in  the  bilinear 
operator  (6. 3. a)  are  constants  or  slowly  changing  functions.  Hence  it  is 
very  common  to  approximate  these  coefficients  by  piece-wise  constant  func¬ 
tions  on  each  element,  in  which  case  we  may  rewrite  the  formula  (6.9)  for 


H 


i.i 


rho  “'W  *  dV 


(7.1. a) 


where  a"  f  are  constants  on  the  element  e.  Similarly,  the  load  function 


f(x.y)  in  the  functional  (6.3.b)  may  be  approximated  by  a  piece-wise  constant 
function  and  hence,  we  may  rewrite  (6.11)  as 


*r  -  'e  j. 


,  v< ax  * 


(7.1b) 


where  each  t9  is  a  constant  on  the  element  e.  This  approximation,  how¬ 
ever,  may  not  be  suitable  for  some  applications,  and  sometimes  It  is  more 
appropriate  to  approximate  1  by  a  spline  function  in  the  same  space  as  the 
solution  function  <p.  In  this  case  we  use  the  same  basis  functions  as  in 


(6.6)  to  approximate 


y  /£,  '•./ 

where  t  ,  is  the  value  of  the  load  f(x.y)  at  node  (e.l).  With  this.  (6.11) 

e ,/ 

may  be  rewritten  as 


/?,  '•  /  4*  v<  V/  “ d)' 


(7.1.C) 


in  order  to  evaluate  the  integrals  (7.1.a/b/c).  an  isoparametric  transfor¬ 
mation  [57]  is  used  (see  for  example  Figure  7.1)  to  map  the  domain  of 
each  element  Qe  onto  a  standard  element  Q  on  some  2-dimensional  space 
cT.y).  name ly 


Figure  7.1  -  An  isoparametric  transformation 


*  =  E  tyOf.y) 

. 

Y  =  Z  IT.Or.y)  y. 

/  =  1 


(7. 2. a) 


(7.2. b) 


where  f(x.y)  -  t  Or  Or  ,y).y  Or.y)).  »*1.  •••.*.  are  the  basis  functions  in  the 
new  space  oT.7>.  and  Or ®.y®),  are  the  coordinates  of  the  it 

nodes  in  the  finite  element  e 

The  Integrals  are  then  evaluated  numerically  over  Q  instead  of  Q 
Without  entering  into  the  mathematical  details,  we  give  only  the  final  formu- 


las  used  to  calculate  HJ  ,  and  bj : 


•*/  “»  “'’W  V/W  V/VV  "•»■«> 


.e,-  - 


1.8  _  ,8 


*'  *  '*  "»  r,W 


(7.3.b) 


*'  '  i-1  *9  “,*‘VV  */VV  *,<VV  <™.C> 


where  <7  (s  the  order  of  the  quadrature  rufe  used  In  the  numerical  integra¬ 
tion,  tog.y'g ),  g=1.***,q  are  the  quadrature  points  with  weights  w^  and 
deteo T,y)  is  the  determinant  of  the  Jacobian  matrix  J 9  of  the  transformation 
Qe-*Q.  From  (7.2),  this  Jacobian  Is  found  to  be 

Te  e  k  —  k 

J1.1  J1.2  =fE  °1^F>  Z^  52  tp.y)  x®  I 


Jh  4.2  j  /E  *,*,**>  Xf 


z  D2  t.fx.y)  y® 
/ =1 


Because  of  the  regularity  of  the  standard  element  0,  we  can  easily 

_  _  _ 

find  the  formulas  for  f.fx.y)  and  its  derivatives  £).*.  =—3  and  D  f.  =—3 

ax  ay 

Then  the  derivatives  r= 1,2  and  /=!.•••. ft  used  in  (7. 3. a)  may  be 

obtained  from  the  transformation 

|°,*vl  *  ,'/V’’  Ml  (74> 


lD2ir/l 

where  [J®]-7”  is  the  inverse  of  the  transposed  Jacobian  matrix  U®]  .  This 
inverse  is  explicitly  assumed  to  exist. 

It  should  be  noted  that  the  quadrature  points  and  weights  as  well  as 

_  _  d*i  d*i 

the  basis  functions  t,  and  their  derivatives  D.?.  = -  and  D.f.  = -  dc 


not  depend  on  the  specific  finite  element  that  is  to  be  processed.  Hence, 
they  may  be  computed  at  the  quadrature  points  Gr^.y^).  and  pre-loaded  into 
the  system  before  it  starts  its  operation  which  allows  for  their  repeated  use 
during  the  calculations  of  H0  and  b 0  for  e=l.-  •  -.m.  On  the  other  hand, 
the  derivatives  D^f.  and  in  (7.2)  have  to  be  calculated  by  (7.4)  for 

each  element. 

The  following  algorithm.  ALG5,  computes  the  elemental  stiffness  matrices 
H0  for  e=l.***,m.  (The  steps  NT  through  N5  in  the  algorithm  are  parti¬ 
tioned  in  a  manner  needed  for  the  description  of  our  systolic  system),  in 
this  algorithm,  we  denote  by  v®(g)  the  value  of  the  basis  function  tjtXg-Yg) 
and  by  A^(g)  and  v^(g),  r=1,2.  the  value  of  Its  derivatives  Dr  ifr.(Xg.7g)  and 

D  t  (x  .7  ).  respectively. 
t  i  g  g 

Algorithm  ALG5 
INPUTS 

1)  (  v°(g),  (g).  A;2(g)).  g  =  l.-*-.g  and  /  = 

2)  For  each  finite  element  e=1,*,,,m 

2.1)  (x?  ,y0 ) ,  /  =1.  -  -  -  ,/c 

»  'i 

2.2)  a0  r  r./=0,l,2  /*  note  that  a®  ;=  a0  f  V 

For  each  finite  element  e =1 .  •  •  •  .m  DO 

N1)  For  each  quadrature  point  g=l.-**.q  compute  the  Jacobian  of  the 
isoparametric  transformation  from 


J2.1(g) 

= 

A | (g )  •  •  -A^g) 

e  e 

X1  'l 

J2.2(g)  ■ 

A^(g )  •  •  -A  2(g) 

x®  y® 

N2)  For  g  =  1,"**.g  compute  the  temporary  quantities 


L» i  (0)-  •  -T^(g)} 


rJi.2Cff) 


N3>  For  0  =  1.*  •  •  .q  DO 


N3.1)  det®(flf)  =  ^(0)  J®2(0)  -  j°2lg)  j*^g) 
N3.2)  v[(0)  =  (1/det®(0))  ff(g).  r- 1.2. 

N3.3)  v^(0)  =  det®(0)  v^(0),  r=0.1.2.  /  =  i . -  -  ♦ , 

N4)  For  /=],•••.*  compute  the  approximate  integrals 
N4.1)  For  /= 1.-  •  -./-I 

*/  /  *  E  ^(0)  v^.(0)  r./=0.1.2 

*'  0=1  '  ' 

N4.2)  =  £  V^(0)  vj(g)  r=0.1.2.  /=0.-**.i 

0=1 

NS)  For  /=!,•  •  •  .k  DO 


r=0.1.2.  1-0. 


N5.1)  For  1=1. •  •  -./-I 


2  2 


NS. 2)  H 


*7./  -  E  E  a®  I  Sr 

'  r=  0  /  =0  •'  1,1 

=  2  j0  &  «V.,  **,» 


where  cr  f  equals  1  If  r*l .  and  0.5  if  r=l. 

The  calculation  of  the  load  vectors  b9 .  e=1  may  be  Included  in 

ALQ5  as  an  additional  step.  This  step  depends  on  whether  we  will  use 
(7.3.W  or  (7.3.C)  for  computing  be .  More  precisely.  If  (7.3.b)  Is  used  then 
N6)  For  /  =  1.*  •  -,k  DO 

t>*  •  t9  Z  7^(0) 

0=1 

On  the  other  hand,  if  (7.3.c)  is  used,  then 
N6)  For  /  =  1.-  •  •  ./c  DO 

b9  =  r  f  y0,0 
/  e.l  Ti.i 
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Figure  7.2  shows  a  block  diagram  of  the  systolic  system  that  executes 

this  algorithm.  It  consists  of  a  local  memory  LM  to  store  the  pre-loaded 

0  1  2 

values  of  Vy  (g).  Ay  C g)  and  A .(.g).  and  six  systolic  subnetworks 

N1  •  •  *  N6  that  are  arranged  In  a  cascade  such  that  the  output  of  a  sub¬ 
network  is  an  input  for  a  following  sub-network.  Each  sub-network  is 
designed  to  perform  the  computation  in  the  corresponding  step  of  ALG5. 


Figure  7.2  -  A  general  block  diagram  of  the  system. 

In  order  to  compute  the  matrix  He  for  a  certain  element  e.  the  coor¬ 
dinates  of  the  nodes  (x®.y®).  /*!.*•  •.*.  and  the  coefficients  a®r  r./=0.1.2. 
for  that  element,  are  fed  to  the  system  via  the  subnetworks  Nl  and  N5. 
respectively.  The  entries  W®^.  /  =  1 ,  •••.*.  /= 1, •••./,  of  the  symmetric 
matrix  H®  are  then  obtained  from  the  sub-network  N5  after  a  delay  period 
of  (q+3k+16)  time  units,  where  a  time  unit  is  the  maximum  time  needed  by 
any  computational  cell  in  the  system  to  perform  its  operation.  This  is  basi¬ 
cally  the  time  required  to  perform  a  Multiply/Add  operation,  or  a  division 
whichever  is  larger.  The  subnetwork  N6  is  used  to  compute  the  vector  b®. 
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The  system  described  in  this  chapter  provides  a  noticeable  speedup  of 
order  qk  over  the  serial  execution  of  algorithm  ALG5.  However,  the  real 
advantage  of  the  system  lies  in  the  possibility  of  pipelining  the  computations 
of  the  stiffness  matrices  for  e=l .••♦,/n.  and  of  obtaining  one  matrix  every 
3k  time  units,  Of  course,  we  also  obtain  the  advantage  of  a  non-conflicting 
and  smooth  data  flow  In  the  system  which  greatly  reduces  the  memory  fetch 
times. 

in  Sections  7.1  through  7.6.  we  describe  the  architecture  of  the  six 
subnetworks  N1***N6.  that  execute  the  corresponding  steps  in  algorithm 

ALG5.  Moreover,  we  will  derive  the  I/O  description  of  the  individual  subnet¬ 
works  and  prove  that  the  system  generates  an  elemental  stiffness  matrix  and 
the  corresponding  load  vector  if  appropriate  input  data  are  provided.  Then 
in  Section  7.7,  we  show  how  one  can  use  the  technique  described  in  Sec¬ 
tion  5.4  to  prove  that  the  suggested  system  can  be  pipelined  for  the  com¬ 
putation  of  all  the  elemental  arrays. 

it  should  be  clear  that  alternate  designs  for  the  components  of  the 
system  may  be  given.  However,  one  advantage  of  the  system  described  in 
this  chapter  is  its  flexibility  in  the  sense  that  only  minor  modifications  are 

needed  in  order  to  use  the  system  for  different  values  of  k  (element  type) 
and  q  (quadrature  formula).  Moreover,  our  primary  goal  Is  to  show  the 
applicability  of  the  systolic  approach  to  the  generation  of  the  elemental 

arrays,  and  to  demonstrate  the  effectiveness  of  the  formal  model  for  a  pre¬ 
cise  specification  and  verification  of  systolic  networks  with  computational 
cells  more  complicated  than  those  of  the  simple  Multiply/Add  type. 

7.1.  The  Subnetwork  Nl. 

The  graph  of  the  systolic  network  Nl  is  composed  of  2q  Interior  nodes 
as  shown  in  Figure  7.3(a);  each  node  is  labeled  by  two  integers  (i.g)  i=1.2 


where  s=  1  for  i=2  and  s=3  for  i=l.  and 
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.m. r-  T* v*  V  V ' 


X  =  4®+/+1**' 3  [p  *  /  1 

/.<?  lP/.g  S.g1 


(7.6.b) 


The  graph  in  Figure  7.3(a)  and  equations  (7.5),  (7.6)  specify  N1  com¬ 


pletely.  In  order  to  analyze  the  Internal  structure  of  each  cell  (i.g)  more 
closely,  we  first  note  that  equations  (7.6.a/b)  indicate  that  a  ceil  should 
contain  a  multiplier  and  two  accumulators  (see  Figure  7.3(b)).  The  accumu¬ 


lators  start  operating  at  times  g+i  and  g+i+1.  respectively.  They  accumulate 
the  output  of  the  multiplier  every  third  time  unit  and  reset  themselves  to 
zero  every  3k  time  units.  The  content  of  these  accumulators  at  consecutive 


time  units  is  expressed  by  the  sequences  K.  g  and  .  As  is  clear  from 
equation  (7.5.c).  each  cell  contains  also  a  multiplexer  that  starts  operating 
at  time  g+i-1  and  multiplexes  the  Input  rr.  and  the  contents  of  the  accu¬ 
mulators  with  time  ratios  3k-2:1:1.  The  delay  element  ns  is  introduced  in 
Figure  7.3(b)  under  the  assumption  that  the  elements  A  and  M  do  not 
consume  any  time.  In  practical  implementations  however,  these  elements  do 
consume  some  time  and  consequently  the  element  labeled  n5  has  the  func¬ 
tion  of  a  synchronizer  rather  than  a  latch. 

After  having  described  the  architecture  of  the  network,  we  prove  the 
following  proposition  about  the  I/O  description  for  N1.  It  is  an  explicit  rela¬ 
tion  between  the  network  output  sequences  p3  .  g=1.**’g.  and  the 

network  input  sequences  tt.|  .  p1  .  /  =  1.2,  g=l,***g. 

Proposition  Nl.l  :  I/O  description  of  the  network  N1.  For  g=l.-**.g.  the 
following  relations  hold; 


p3.g  '  n  pl.g 

~  w3k-4.1. 1,1.1  ,^3  ,  t  _3, 

V5,g  =  n  Mg  +3  <fl  *l.g  '  X2,g  '  X2.g  '  n  X1.g 

where 


<«'"  p,.g  * 


(7. 7. a) 
(7.7.b) 

<7.7.0 
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r  _  =  Ag«^.k. 3  /- 1 


nff_1  </.lJ 


(7.7.d) 


Proof:  To  prove  (7.7.b).  we  first  note  that  (7.5.a/b)  have  the  solutions 

*  n“"'  l7ea> 

0/.„  -  n  »i.a 

Then  from  (7.5.c)  we  obtain  for  p=l.***,<7  that 

*3.0  =  n  Mptl  ’  *  <*2.p  •  ^2,0  •  X2.0} 

-  r\  *j3k ~2. 1 , 1  ._3  ..3k-2.hl  ,  ,  -  „  -  v 

n  M0tl  <**  M0  <*1.B,X1.0,X1.0>  '  X2.0  '  K2.0} 

where  X.  and  X^  are  as  given  in  (7,7.c)  and  (7.7.d),  respectively.  With 
Property  P5  from  Appendix  A  this  may  be  rewritten  as 

n3.g  "  n  C'U  '  n\.9  •  •  >>2.9  •  \>.9> 

Finally,  we  obtain  (7.7.0)  by  applying  property  P13  from  Appendix  A.  Equa¬ 
tion  (7. 7. a)  results  directly  from  (7.8.b).« 

in  order  to  perform  the  calculations  in  step  N1  of  ALG5  for  a  certain 
finite  element  e.  Ke<m.  the  input  sequences  must  be  described  by 


(7. 9. a) 

C,  ’  =  n,_1  E ®  [92  i=1.2  <7 . 9. b> 

p1.0  =  ’  n02Vl  •  n202,p0.2))  =  (79c) 


where 


r(«P  =  7(^0.O)  =  T(<p0.1)  *  r%.2)  =  * 


(7.9.d) 


II 

Ob 

* 

vf°(0) 

■6 

<Q 

— j 

•? 

II 

A^g) 

<p0.2tf)  = 

Af(0) 

r  Ye 

if  i =1 

£®Cf)  = 

\  ' 

if  i  -2 
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In  other  words.  |®  and  contain  the  coordinates  of  the  nodes  of  the  ele¬ 


ment  e.  and  fg  q.  Vg  and  <0^  2  contain  the  shape  functions  and  their 
derivatives.  A  pictorial  representation  of  these  input  sequences  in  the  case 
k=3  and  q=3  is  provided  in  Figure  7.11  using  a  time  diagram  of  the  ele¬ 
ments  of  the  different  sequences  at  consecutive  time  units. 

Proposition  N1.2  :  With  the  inputs  (7.9).  the  outputs  of  the  network  N1  are 
described  by 

2  -  -2-2 


3,0f  =  nff+1  P\k^YA^\.o  ■  ne\.  1  •  2»  9*1. *  •  * <q  (7. 10. a) 

_  +3k  1  „e  _ ^  a 


3,g 


/S' 


g*l.  •  •  •  .q  (7.10.b) 


where  T  (/S®)  =  4  and  06(t)  -  J®  1  (g),  J®2(g).  J®  ^  (g)  and  J®2(g)  ,or 


t=l.  2.  3  and  4.  respectively. 


Proof  :  The  proof  of  (7. 10. a)  follows  directly  from  (7. 7. a).  To  prove  (7.10.b), 

,3k 


we  first  note  that  the  operator  Pg  in  (7.9.c)  indicates  that  the  first  3k  ele¬ 
ments  of  the  argument  are  repeated  twice  in  p1  .  This  repetition  is  only 
necessary  for  the  operation  of  the  subnetwork  N2.  and  will  not  be  con¬ 
sidered  here.  Hence,  we  will  replace  the  last  3k  elements  of  the  repetition 
by  don't  care  elements  which  reduces  (7.9.c)  to 


pl.g  =  n0_1  W!''U(e2Vo  • 


(7.9.e) 


Now  substitution  of  the  input  sequences  (7.9.a/b/e)  into  the  I/O  descrip¬ 
tion  (7.7.b)  results  in 

3. 


„  ..3k -4. 1,1, 1,1  *  %  -  .3.  _3-r-  , 

n3.g  =  n  Mgt 3  0  '  X2,g  '  X2.g  '  n  Xl,g  '  n  Xl.g)' 


(7.11.  a) 


Here,  by  (7.7.c),  the  definition  of  the  E  operator,  and  the  properties  PI  and 
P7.  we  find  that 


/>2*-3  «]',,,<e2i»e_0*t*i  .  ne2i*9 _,.«?!  . 


and.  by  P14.  that 


*.  \ 


-  ri**'-1  »’*'3  e2  „  . 


Similarly,  we  can  show  that 


1  nBtl  A'-*'3  92  l*s.2  ■  <fl 


(7.n.b) 


(7.11.C) 


For  a  further  simplification  of  the  equations  (7.11. a),  we  consider  the 
definition  of  the  multiplexer  operator  with  the  restrictions  (7.9.d)  for  the 
involved  sequences.  This  gives,  for  g  =  l.***.q. 

H.,  -  o'*3*"  9' 

where  T  i$9)  -  4  and 


x2,ff(P+3*-1) 

for 

f  =  l 

\2g(g+3k) 

for 

t=2 

X1  0(gt3k-2) 

for 

f  =3 

Xl0(g+3/c-l) 

for 

f  =4 

/3eU)  = 


Moreover,  from  (7.11.b).  P1 1 .  and  the  definitions  of  the  shift  and  the  spread 
operators,  we  obtain  that 

X2g(g+3*-l)  =  tnff+1  92  A1'**1  [<pg  1  -  i9  ]]<g+3*-l) 

=  CA 1  ' 1  [*pg  -j  '  l\  11(A) 

*  e  *  ■)  eg 

=  E  ,</>  «2(/)  =  E  &  <0>  n<<7> 

/ =1  /=1  '  ' 


where  J®  1  (g)  is  specified  in  algorithm  ALG5. 


By  a  similar  argument,  it  can  be  shown  that  /S®(2).  /8®(3)  and  £®(4) 
are  equal  to  J®g(g).  J®  1  (g)  and  J®2(g).  respectively,  which  proves  the 


proposition  and  shows  that  the  network  performs  successfully  the  calculations 
in  step  N1  of  ALG5  for  one  finite  element  e.« 


(7. 12. a) 


Yl  V-  TP  TOWTO? Y-  g  “J>  ?  >..  ? »..  ».J '..  ■; 
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For  cells  (4.g) 

c 

77-  =  n  77, 

6.0  4.0 

P5.0  =  n  P< 

"5.8  - 

“>4.8 

*  [ 
0 1 1  0  t4 

L  • 

For  ceils  (5.g) 

°7.8  *  "  "«»Y 

<P5.8  • 

"5.8  ■  "5.81 

u 

4.0 


OtYO-V,!  •  ^*3  »«.,  0'>'  <7,2t» 


(7.12.0 

From  the  above  specifications,  it  is  clear  that  cells  (3.0)  and  (4 .g). 
Kg<p,  have  identical  structure  (see  Figure  7.4(b))  and  differ  only  in  the 
reset  times  of  their  accumulators,  multiplexers  and  memories.  To  reset 
these  elements  at  the  proper  time,  external  reset  signal  can  be  propagated 
in  the  network  as  explained  in  Section  5.2. 

Proposition  N2.1  :  The  network  I/O  description  of  N2  is  given  by 

0  =  1.  •  •  •  .q 


”6.8  *  "4  "3.8 

<>7.8  *  0  "JtY  “>*  <>3.8  •  * 


3.0  *  X4.0) 


(7. 13. a) 


0=1.-  --.q  (7.13.b) 


where 
X 


3.8  *  "2  “>3.8  ’  £8*3  "3.s’  '  "  ">3.8  *  ^2  "3.,1 


(7.13.0 


X4.0  “  n  tn  P3.0 


^.3  nS.8>  -  n2  [n  <>3.8 


*J*t4  n\.g'  l7,3a) 


Proof  :  Equation  (7. 13. a)  is  trivial.  In  order  to  prove  (7.13.0),  we  begin  by 
applying  property  PI. 3  to  equation  (7.12.a): 


°5.8  ■  "  A"’3''  3.8  ‘  "3.8 


_3k 


[-17-  J  .  0  ) 


3.0  0  +2  1  "3,gJ 

a 

Then,  we  apply  property  P14  and  use  C  to  replace  sequences  whose 
values  are  irrelevant  to  our  analysis.  This  gives 

zt*.  -  ff,  J  -  [p. 


O-  =  n  m!*1,1  (0 
5,0  0 


•  n  tp3,0  *  ^g  +3  "3,0J 
which  from  P5  may  be  written  as 


3.0 


^72  "3.81  '  6’’ 


,.1,1,1  ' 

°5.0  =  M0tl  (0 


X3.g  •  0  > 


(7. 14. a) 


where  X3  is  described  by  (7.13.C).  Similarly  from  <7.12.b)  we  obtain 


u5.g  *  "Vl  'o.o.  *4.gJ  (7.14.b) 

where  X4  is  described  in  (7.13.d).  Finally,  substituting  (7.l4.a/b)  in 
(7.12.C).  and  using  P13  we  obtain  (7.13.b),  which  completes  the  proof.B 

The  input  links  of  N2  are  directly  connected  to  the  outputs  of  Nl.  and 
hence  the  input  sequences  tr3  g  and  p3  g=l.***.g  are  described  by  the 
formulas  (7.10). 


Proposition  H2.2  :  If  the  inputs  to  N2  are  given  by  (7.10).  then  its  outputs 
may  be  described  by 
_  rt0+3**3 


er 


n6  a  =  n 

9  ,  rtg+3*t4  „1.1.1  to2, 


g  =  l.  •  •  •  .q  (7. 15. a) 


'7.g 


=  n" 


M\""lQ‘*g.O  '  ne‘fg.  1  •  n20%.2)  a  (7.15.b) 


where  r(1rg  ,2)=*  and  fg  ,i(°=7f  (0><  2<n=7f(ff)'  wi,h  rt  (0)  and 

Tf(0)  as  specified  in  algorithm  ALG5. 


Proof  :  The  proof  of  (7. 15. a)  Is  trivial.  In  order  to  prove  (7.15.b).  we  will 


ignore  the  value  of  the  first  3k+g+l  elements  in  the  input  p_  ,  and  hence 

d.g 


rewrite  (7. 10. a)  as 


3.(7 


=  n0+3*+1  M]  h\e\  0  .  ne\  }  .  n2e2^  2) 


(7.10. c) 


in  order  to  find  the  output  sequences  p7  .  we  obtain  an  explicit 
description  for  X3  g  and  X4  by  substituting  the  input  sequences  into 
(7. 13. c/d).  Indeed,  from  (7.10.b/c)  it  follows  that 


p3.g  *  Eg+3  *3.g 


-  ng+3‘fl  ^'1J«S.o  •  <»S.i  ' 


»  p3k  g  +3/c-l  e 
g  t3  n  0 


We  then  interchange  the  shift  and  expand  operators  using  P6  and  apply  PI 8 
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p3.e  '  ^‘t3  "3.S  -  n8*3*'1  in2  <®2Vo  .  ne2v,  .  nV^  y 

•  fj*  f\  1 

*  n9  *  1  in3n  1  “J1’'1  <e2ra  0  .  ne2»a  ,  .  n2e2«a  2)  . 

-  nef3*t’  .  ne2  .  v,,  . 

where,  as  usual,  the  sequences  irrelevant  In  this  context  were  replaced  by 
* 

0  .  Similarly,  we  obtain 

P3.0  *  ^t2  *3.0  =  n°f3ktl  .  0*  .  nV  [J®  l(g)  .  ^  2], 

and  thus  from  (7.13.0  that 

>3.„  -  n0+3k*3  A.]-’-1,,'  .  ne2  wj.*,  .  -  .  »*,  - 

n®  +3*  +2  Mi.  .  (0*  0  n2e2[j®  ^0)  .  ^  2d 

*  n9+3/f+3  [m]'1j(o  ,  ne2  ^|2(0)  .  <p  ,i  ,  0*)  - 

<0  .  ne  uj1(0)  .  *g  2j  .  0  )] 

=  n0+3/c+3  ne2  £,y®  2  (ff )  .  ^  J  -  J® ^  (g)  ,  <pg  2]  .  0*) 

By  a  similar  analysis  it  follows  that 

-  fi*+3*+3  rfioPr.fi  .  n 


Mi'  (0  ■  0  ■  n  0  •  <eg  2  ~  2(g)  ■  *0  iIJ 


Finally,  we  substitute  into  (7.13.0)  the  computed  values  for  \  and 

3,0 

X4.0  t09ether  w,,h  the  inPut  sequence  p3  and  apply  properties  P5  and 
Pi 3  to  obtain 

p7.0  *  n°+3k +3  ^11'1,1»S.O  '  n0S.l  '  n202’r0.2)  = 

where 

^0.1 (f)  -  JI.2(9>  •  *g.  ltt)  -  •  ,P0.2(f) 

*  J2,2(g)  At(s,)  *  J2.1(g)  A?(ff)  =  rr<g) 

y0.2(f)  *  J1.1(g)  *g.2(t)  “  J®.2(g)  Vl(°  =  Tf(g) 

This  proves  explicitly  that  the  output  sequences  p7  contain  the  results  of 

Step  N2  in  ALQ5.B 
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7.3.  The  Subnetwork  N3. 

As  in  the  case  of  N2,  the  subnetwork  N3  is  composed  of  q  indepen¬ 
dent,  identical  rows,  namely,  one  for  each  point  used  In  the  numerical 
integration  (7.3).  Each  row  performs  the  calculation  corresponding  to  step 
N3  in  ALG5  for  a  certain  value  of  g,  Kg<g.  In  Figure  7.5,  we  show  the 
graph  for  the  g^*  row  of  N3.  The  function  of  the  cell  (6,g)  is  to  compute 
the  determinant  of  the  isoparametric  transformation  as  in  step  N3.1  of  algo¬ 
rithm  ALG5.  Its  operation  may  be  formally  described  by: 


=  w*.g  *  n  m\g  '  »e.„  -  °\.g  •  "Vo* 

The  cells  (7,g)  and  (8,g)  perform  the  computations  in  steps  N3.2  and 
N3.3  of  ALG5.  respectively.  Their  operation  may  be  described  by 

P8.g  ~  n  Mg+3 k+7  (fl  p7,g  '  p7.g  *  °7.g]) 
p9.g  =  n  p8,g 

n9.g  =  n  l"g  *  *8.g  *  p8.g] 


With  this  description  and  the  inputs  (9.15),  it  Is  easy  to  obtain  the  out¬ 


put  on  the  links  pg  g  and  rgg,  g=l.*  namely 


ngt3k+8  (Q2 

ngt3k+8 


2  2  — 

n  6  vg  .2! 
2  2 

n  e  vg  .2* 


(7. 16. a) 
(7.16.b) 


where  ^  r(t)  =  v^(g).  r(t)  -  v^(g),  and  the  values  of  v^(g)  and  v^(g) 
are  as  given  in  step  N3  of  ALG5. 


T.MV.V1. 


7.T 
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7.4.  The  subnetwork  N4. 

In  this  subsection,  we  describe  a  network  that  completes  the  numerical 

integration  by  computing  the  quantities  =  r  -r  i 

1  -I  ff  =  1  V/  Vj  for  the  ranges  of 

the  indices  in  the  corresponding  step  of  ALG5.  The  subnetwork  is 
described  by  the  graph  in  Figure  7.6.  The  node  I/O  descriptions  of  a  typical 
interior  node  ( i.g)  1—9. •  •  *. 8+3/c.  g—t.m,-.q.  are  given  by 


ni+hg 

<a> 

fc 

v 

C 

ii 

(7. 17. a) 

P/'+l.g 

*  n  <>/.(, 

(7.17.b) 

C/.g+l 

(7.17.0 

As  this  description  shows,  each  cell  latches  the  p  and  r  data  streams 
by  two  and  one  time  units,  respectively.  It  also  performs  a  Multiply/Add 
operation  and  puts  the  result  on  the  zth  output  link. 

Proposition  N4.1  :  The  I/O  description  for  N4  is  given  by 

</.«+ 1  *  n/_8+<7’»  tn/_9  n9g  *  p9gj  /.9.---.8+S*  (7.18) 

Proof  :  To  prove  this  proposition,  we  first  write  the  solutions  of  (7. 17. a)  and 
(7.17.b)  in  the  form 
20-9) 

v i  a  ~  0  i  =9,  •  *  •  .8+3k  .  g  =  l,  •  •  •  .q 

v  0-9)  w 

P/.g  =  n  p9.g  I  =9.  •  •  •  .8+3k .  g=l.---.g 

and  then  substitute  them  into  (7.17.C).  This  gives 

<(.e  +  l  *  n  “l.g  *  n2<i‘9'  '  n'_9  »9.s'  l7  ,9> 

By  Lemma  1  in  Appendix  A,  the  solution  of  (7.19)  for  a  fixed  i.  90  <8t3k 
is  then  found  to  be  identical  with  equation  (7.18).  This  completes  the 
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(XuS/c-l  and  0<v<2.  More  descriptively,  we  divide  the  3k  columns  of  N4 
into  k  groups  of  3  columns  each.  Thus,  we  rewrite  the  network  description 
(7.20)  as 


‘u  .V.Q  +  1 


Q 

=  E 

g=i 


n 


3u+v  +  l+d  -g  y^3u+v 


9.0 


P9.0] 


(7.21) 


Proposition  N4.2  :  With  the  inputs  described  by  (7.16).  the  network  N4  has 
the  following  output: 


‘u  .v.qt  1 


_2(3utv)tw  — 

=  n  7? 


u  ,v 


u=0. 


,/c-l  and  v=0.1.2 


(7.22) 


with 


where  □  is  a  modulo  3  addition.  w=q+3k+9,  and  we  have  for  0</<2. 


r.l  I  k~u  "  ' 

<V>  H 

l  k-u- 1  /7  r 


/f  r  </ 
•>/ 


and 


/•' 

‘  t+u 


-»'«>  ■  i 

'  A  A 


t  .f+u+1 


if  rC/ 
if  r  >1 


Proof  :  Using  the  input  sequences  (7.16)  in  (7.21)  we  obtain 


-  m’  -1-1  <e2vo  „  .  ne2>y,  .  nV^y  > 


<e‘Vo-  “V,- 


*  *  “!  ,  ,<9S.o-  <*>%.,•  "29S.2>' 


(7.23) 


where  again.  w=q+3kt9  and  v  g  Is  found  by  properties  P4  and  P5  to  be 
equal  to 


ku  .v.g 


Mj-'-’reV  vg  0  .  ne2n"  vg  ,  .  n2e2n"  yy 
M!  '  '  (e2n"+’  y2  .  ne2n"  y  0  .  n2e2n"  y  y 


if  v=0 
if  v  =  1 


m]'1*1  (e2nu+1  v 


g.l  ■  V2  ’  n292n“  Vo’  "  ^ 


The  result  (7.22)  is  then  obtained  by  first  applying  PI  to  perform  the 
multiplication  in  (7.23),  then  by  pulling  n3u+v  out  of  the  M  operator  with 
the  help  of  P5  and  by  applying  the  summation  to  the  arguments  of  M  (pro- 


perty  PI. 2).  As  an  illustration  of  the  derivation  procedure,  we  consider  the 

case  v=l  for  which  we  have 


^u.l.q+1  =  n 


3u+1+w  £  -.1.1.1  ,_2rrtt/+l  - 

M,  (6  tn  ygA  *  Vg  Q]  . 
ne2m"  vga  •  »g_,i  .  AV  ,  -  V2» 


=  n3ut,t*  «V"  „20  .  na2n“  .  n2e2n"  „>2> 

2  0  0  1  1  2 

where,  from  PI.  7 (t?u  )  =  k-u- 1.  7(7JU‘  )  =  Tin  )  -  k-u  and 

^'0"’ =  J,  ‘vj"’  ■  vo<'t",,‘i  *  e|,  *?<*>  *?„♦,<«»  - 

’’S  '"’  =  E  t?  «)  ■  »,««,>!  =  E  [vfca)  >  Y?jtll 
0  =  1  0=1 


,<;-2tt)  =  Et  (»„.,«)  ■  -j.jB™*)  -  E  !»,’<»>  b,2.„<9>'  • 


Finally,  we  apply  P5  to  get 

,  2(3u+l)+w  -.1.1.1  0.1 

'  n  M,  (9  ,u 

which  is  a  special  case  of  (7.22)  for  v=l. 
proved  in  an  exactly  similar  way.B 


2  12  22  20 

.  ne  vu  .  n  e  > 


The  cases  v=0  and  v=2 


are 


Equation  (7.22)  shows  that  the  output  sequences  (u  y  ^  contain  the 
results  of  the  numerical  Integration  needed  for  the  calculation  of  the  stiff¬ 
ness  matrices  in  the  next  subnetwork  N5.  It  also  specifies  precisely  the 
time  of  each  output  data  item.  In  Figure  7.7.  this  specification  is  translated 
into  a  time  diagram,  where  we  plot  the  elements  of  v  fl  +  1  versus  time 
for  the  special  case  of  k=3  and  q=3. 


7.5.  The  Subnetwork  N5. 

The  network  N5  Is  composed  of  three  different  rows  (see  Figure  7.8). 
Row  q+1  contains  3k  identical  nodes.  It  receives  the  constants  a®,  on  the 


1,4  3V/«  **•*•< 


Figure  7.7  -  The  output  of  N4. 

links  P9cj  +  1-  r9g+i  and  s9qtl  and  dls,r,butes  them  appropriately  on  the 

b  colored  links  such  that  each  integral  Y^'j  appearing  on  a  z-coiored  link 

meets  the  corresponding  constant  a®  f  at  the  right  time.  Row  q+2  also 

contains  3k  identical  nodes  and  computes  the  partial  sums 

2  2 

Uy  I  =  E  a*  /  ^  /  and  E  {cr  /  ar  ^  /  for  ^  and  ' =/  •  respectively. 

where  cf  /  is  as  given  in  ALQ5.  Finally,  row  q+3  contains  only  k  nodes 

a  2  r 

that  complete  the  sum  Hl  .  =  £  u.  ^ .  The  edges  of  the  graph  are  given 

0  1  2 

the  colors  p.  r.  s.  b.  z.  z  .  z  or  z  as  shown  in  Figure  7.8.  Note  that 

0  1  2 

we  used  three  different  colors  z  .  z  and  z  to  satisfy  the  restriction  that 
no  two  edges  ending  at  a  node  have  the  same  color.  To  simplify  the 

analysis,  we  consider  each  of  the  three  rows  separately. 


We  consider  first  the  row  q+1  in  which  each  cell  simply  latches  the 


Figure  7.8  -  The  graph  for  N5. 

four  data  streams  z,  p.  r  and  s  by  one  time  unit,  and  selects  the  output 


on  the  b  link  to  be 


n  lh; 


*/.<7  + 11 


i  .q  +2 


*  {  n  p 


/  .Q  +  l 


n  o 


/  .<7  +  1 


If  /=9+3u.  u=0.  •••./f-1 
if  /  =9+3u  + 1 ,  u  =0.  •  •  •  ,fc  -1 
if  /=9+3u+2.  u=0.  •  •  •  ,/r — 1 


where  /r  =0.5  for  /=9  and  h.  =1.0  for  />9.  The  factor  0.5  is  needed  to 
implement  step  N5.2  in  ALG5.  where  only  the  Y^ 'j .  /<r  are  explicitly  avail¬ 
able  for  the  computation  of  H® while  we  have  Y^’J  =  ^'r.  for  l>r. 


For  the  proper  operation  of  the  system  the  input  sequences  should  be 
described  by 

”9.,  + 1  "  n"  '•‘“S’ 

<Wl  *  <“l>  <724> 

_w  03  .  e. 

a9.„+>  =  "  W 

where  for  /=0,1.2.  T (a®)  =  3  and  a®(f)  =  a^_iQ2y  f_i  •  witl1  D  denoting, 
once  again,  the  modulo  3  addition  operation.  More  descriptively,  we  input 
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on  each  line  three  of  the  constants  a*  .  r.l  =0.1.2,  repeated  k  times  as 

f  #• 

3 

indicated  by  the  piping  operator  P k  (for  more  details  see  Figure  7.11). 

Using  the  two  indices  0<u<k-1  and  0<v<2  as  in  the  previous  subsec¬ 
tion.  and  noting  that  the  Input  ^  is  given  by  (7.22),  we  can  easily 

show  that 


^ u  y  .q+2 
^ u  y  .q+2 


J2(5u  +v)+w  +  l  - 

n  vu  v 

n3u+vtwtl  p3(  ^ ^ 


(7.25ia) 

(7.25.0) 


where  />  =  0.5  if  u=v=0  and  h  ,  =1.0  otherwise, 

u  ,v  u.v 


Figure  7.9  -  A  typical  cell  in  row  q+2  of  N5 
The  3k  cells.  (I. q+2).  9</<8+3*.  in  row  q+2  have  basically  the  same 
structure,  each  is  a  multiplier/adder  equipped  with  a  demultiplexer  that  dis¬ 
tributes  the  results  to  the  output  links  p/  +  |  q+2  and  Cuqi. 3  (see  Figure  7.9 

i  -9 

where  u  and  v  equal  the  quotient  and  the  remainder  of  -g-.  respectively). 
Formally,  the  operation  of  each  cell  (l.q+2)  Is  described  by 


P/  +  l.<7+2  =  n  M/-9+w  +  l(t  '  P/,q+2  +  V 
,q  +3  =  n  M/-9+w  +  l  (p/.q+2  +  \  '  ° 


(7. 26. a) 
(7.26.b) 


where  X.  s  qt2  *  q+2  and  the  input  p9,q+2  ls  Permanent,y  set  t0 

v 

the  zero  sequence  t.  For  a  description  of  the  outputs  Cu  qt$- 
(7.26)  using  Lemma  3  In  Appendix  A.  This  yields 


we  solve 


n  m 


(\j  .  L ) 

i  =9 

an2xf_1  +  x.]  .  o 

/  =  10 

(  ia4xy_2  +  n2x/_1  + 

M  . 

i) 

/  =  11. 

where,  by  (7.26)  and  (7.25),  X^  =  *3</+y+9  is  9*ven  &Y 

\  =  . '  <<V,  -V  ’  n3""  V,' 


(7.27) 


^eotvtwti  ro3  ;:  v  - 

-  n  £P.  (/i  .a  )  *  fl  7)  I 

k  ~u  u  ,v  /  9  fj  y1 

Using  the  definition  of  7?y  ^  from  Proposition  N4.2  and  with  the  help  of  pro¬ 
perty  P18.2  we  rewrite  X^  in  the  form 

\  •  ti6"*' n*  <e2,^  .  ne2d^'D'  .  nV,2'“.  <7.2a> 

where,  for  r=0.1.2  , 


»'/,'Dr  1  *  „  «*«»Or>*l>  .  Vr  v°r  =  ft  a*  „  „f 

u  u  .v  v  'u  u  y  r  yDr  'u 

that  is  =  T( i)^1)  and 

r,/..,  _  .  e  r.l 
Mu  (f)  =  ,v  ar ,/  *u 


r  ,vO r 


(7.29) 


Proposition  N5.1  :  With  the  input  described  by  (7.22)  and  (7.24),  the  inter¬ 
mediate  sequences  ^+3,  u*0. ••  •.#-!.  v=0.l,2.  are  given  by 

6utv+w+l  M1.2(n3Q2u2,2  +  ^2.^  +  ^2.0^  fl*}  fQr  y=0 

Cqt 3  *|n6WtV+W  +  1  M;-2(n3e2[^’2  +  M^1  +  .  6*)  for  v  =  l  (7.30) 

(_n6«tvtw  +  1  M1.2(n3e2u0.2  +  M0.1  +  ,0.0,  0'>  {Qr  v=2 

where  we  extended  the  definition  of  such  that  nr_^  equals  the  zero 

sequence. 

Proof  :  For  the  case  />11.  we  first  use  P5  to  rewrite  (7.28)  in  the  detailed 
form 


-  ♦  **Vv  *  .**  -  •  »N  -*•  .  •  .  *  .*•  .  *  k'»  .*• . ■>  ,%  ^  . 


*  s 
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\ 

n6„t»  1.1  ^2.2  n92#0,0  „2 

for  v=0 

I 

\  -j 

J  M|-'.'<n392a>.2  n4e2fl2.0 

nV/1, 

U 

for  v  =  l 

n8ut»»2  1.1  m3e2)l0.2  n4e2a1.0 

n5©2^2'1) 

u 

CM 

11 

> 

h- 

O 

Then,  for  the  evaluation  of  X/_1  =  X3u+y,+9.  we  note  that  0<v'<2  and 
hence  i-1  =  3u+v+8  should  be  written  in  the  form  3(u-l)t2t9,  3t/+0+9  and 
3u+l+9  for  v=0,l  and  2,  respectively.  With  these  forms  for  i-l  in  (7.28) 
and  the  help  of  P5  we  get 


n6"~  .  ne2»Jf,  .  nV^2) 

n2x/H  .  n6"*'"'  mV.”  .  nV»2-2  .  nV*™) 

I  n6U,»,2  MU.t(n3e2ll0..  nV#.,2  n5e2B2.0) 

Similarly,  we  write  i-2  as  3(u-l)+l+9.  3(u-l)+2+9  and  3u+0+9 
and  2.  respectively  and  get 


for  v=0 
for  v  =  l 
for  v=2 
for  v=0,l 


i 

.  n6"t"'  «;  ,1<nV,2  »  . 

2  0  1 

08  »«-i  • 

2  2  12 
n  e  *„  -7 

for 

O 

II 

> 

II 

CM 

1 

/< 

* 

C 

1  6u  tw  +  1 

1  0 

mV.;* 

•  nVu2’ 

■  «*•*.«> 

for 

V=1 

( 

.  ^Gu+w+2 

nVuJ-2. 

for 

CM 

II 

> 

4  2 

Then  by  adding  these  three  formulas  to  get  n  X;._2  +  fl  X._1  +  X.. 
and  by  substituting  the  result  in  (7.27),  we  directly  obtain  the  equation 
(7.30)  for  Ku<fc-1. 


The  case  u=0.  that  is  /  =9, 10  and  11,  can  be  analyzed  in  an  exactly 
similar  manner  yielding  the  result 


fl' 


*fj'2<n392  («2'2J  .  o') 

for 

o 

II 

> 

<2«v»;2  t  «;•’) .  o-» 

for 

V  =  1 

«!  2<nv  » *;•’  t  d°„°i  • 

for 

CM 

II 

> 

which  by  defining  -  i  may  also  be  put  into  the  form  (7. 30). I 


0  1  2 

Finally,  each  group  of  three  sequences  Cu  q+3.  Cu  q+  3  and  Cu  qt3  IS 


considered  as  Input  to  a  cell  (u.q+3).  0<u<k-l.  In  row  q+3  of  N5.  The 
operation  of  a  typical  cell  in  row  q+3  Is  formally  expressed  by 


C  —  fife  ^6u+w +5.3.1  >2  .. 

u.qf 4  cu  '  oo+w+5^u,q+3  '  ^u.q+3  '  ^u,q+3J 


.1 


■  mc«  •  »'•  <n2«2.,t3  * 


where  equals  to  2  for  u=0  and  to  1.  otherwise. 


By  substituting  the  sequences  (7.30)  into  the  network  description  (7.31) 
we  easily  find  the  description  of  the  output  sequences  as 


<u  ,q+  4 

where 


eu+w+7  2  J* 
u 


u=0.  -  -  *  ,/c -1  (7.32) 


2 

E 

r  =0 


+ 


2 

E 

r=1 


r-1 

E 

1=0 


2  2 
E  E 

r=0  /=r 


/f  u  >0 


if  u  =0 


Using  the  definition  of  My'*  from  (7.28/22)  in  (7.32)  and  comparing  the 
result  with  step  N5  in  algorithm  ALG5,  we  readily  prove  the  following  propo¬ 
sition: 


Proposition  N5.2  :  if  the  inputs  to  the  network  N5  are  given  by  (7.22)  anc 
(7.24).  then  the  network's  output  sequences  are  given  by 


C 


u 


,q+  4 


6 u+w+7 


0 


u  =0.  •  •  •  ,/c  -1  (7.32) 


where  T <<Z^)  =  k-u  and  M^ct)  =  • 

Proposition  N5.2  states  that  after  an  Initial  time  period  of  6u+3k+q+16 
units,  each  output  link  Cu  q+4  will  carry  the  elements  of  the  uth  off- 
diagonal  of  the  elemental  stiffness  matrix  He .  separated  from  each  other  by 
2  time  units  (see  Figure  7.11). 
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7.6.  The  Subnetwork  N6. 

The  purpose  of  N6  is  to  generate,  for  each  finite  element  e.  the 
entries  bf  .  /«i. •*•./(  in  the  elemental  load  vector  o  .  The  design  of  the 
subnetwork  depends  on  whether  we  apply  step  N6  or  step  N6  of  ALG5  to 
generate  b.  .  In  the  following,  we  consider  each  case  separately. 


7.6.1.  Realization  of  step  N6  in  ALG5. 

in  order  to  compute  b 9  by  step  N6.  the  values  of  the  load  f9  at  the 
current  element  e  should  be  supplied  from  outside  the  system.  On  the 


other  hand,  the  quantities  Vj (g).  g=1  .•••.q,  /«1, 
on  the  output  links  P3ki.9g.  0=1.- • -.<7  of  N4. 
non  relevant  information  in  (7.16.a)  by  don't 
sequences  on  the  links  P3kfQg-  0=1. • •  •  .q  as 

„  _6k  _  g +9k  +8  -2  - 

3k  +9. g  "  n  11 9. g  s  n  9  V0,O 


•  •  •  .k  are  readily  available 
More  precisely,  replacing 
cares,  we  may  write  the 

0  =  1.-  •  •  .q  (7.33) 


where  Tivg  q)  =  k  and  vg  0<f>  = 

Figure  7.10  shows  a  realization  of  N6  by  a  systolic  network  that  is  to 
be  connected  in  cascade  with  N4.  It  is  composed  of  q+l  computational 
cells  whose  operations  are  described  by 


^+9.0  +  1  =  n  Mg  ([C3kt9.0  f  ff3k+9.0J  '  0  } 

1,2  —  * 

** 3k +9.qt2  =  0  M0+l  (lC3k+9.0+l  *  n3k+9.q  + 11  '  5  1 


(7. 34. a) 

(7.34.b) 


where  the  colors  of  the  links  are  as  shown  in  Figure  7.10.  Note  that  the 
cells  In  N6  are  either  simple  adders  or  simple  multipliers  that  operate  once 
every  three  time  units. 


The  input  links  p 3kfgg-  0=1  .•••.q  are  connected  to  the  corresponding 
output  links  of  N4.  Moreover,  we  set  the  link  z^  +9  1  permanently  to  zero 


Figure  7.10  -  The  graph  for  N6. 


and  supply  the  load  t9  through  P3k+9qi.1-  that  is 


^3Ar+9.1  =  1 

-  _a t9k  +9  2  -e 

n3k+9.q  +  l  ~  n  9  * 


(7.35) 


where  T(i^)  =  k  and  jftt)  =  fe .  for  any  t*k.  With  these  Inputs,  it  is 
straight  forward  to  prove  that  the  output  on  the  link  *3ki.g  f2  Is  described 
by 


/  =  n<7+9*+1°  e2  H9 

C3k+9.g+2  "  9  Mk 


(7. 36. a) 


where  T =  k  and  JZ^(f)  =  b®.  That  is  N6  does  indeed  generate  the 
elements  of  the  load  vector  b9 .  From  equations  (7.32)  and  (7. 36. a),  it  is 
clear  that  the  elements  of  b9  and  H9  are  generated  from  N5  and  N6, 
respectively,  at  the  same  rate.  The  initial  delay  on  the  individual  output 
links  may  be  changed,  if  we  wish,  by  adding  the  appropriate  number  of 
latch  ceils.  For  example,  by  adding  a  cell  that  delays  the  sequence  in 
(7. 36. a)  by  six  time  units  and  produces  the  output  on  a  link  labeled  z 

n 

we  obtain  an  output  sequence  that  has  the  same  form  as  the  sequences 
in  (7.32),  namely 


7.6.2.  Realization  of  step  N  6  in  ALQ5. 

One  can  look  at  step  N6  of  ALG5  as  a  matrix-vector  multiplication 
b9  -  Y  F  where  the  entries  of  the  vector  F  =  {  f  )  are  the 

values  of  the  load  function  at  the  nodes  of  e.  and  Y  =  {yj3*®,  /./= 1. •••,*} 
is  a  symmetric  matrix.  Fortunately,  the  values  of  Y®'®  are  available  on  the 
output  links  zy  0cJ+^-  u  =  0.***.k-l  of  N4.  More  precisely,  from  (7.22)  we 
have 


,  6u  +w  2  .0.0  .  ,  , 

Cu.0.gt1  “  n  6  vu  u=°.  1 

where  r(Tj°'0)  =  k-u  and  7)°‘°(/)  =  Y®'° 
u  'o  t.tfu 

This  form  of  the  matrix  Y  enables  us  to  use  the  matrix-vector  multipli¬ 
cation  array  of  Section  3.1  to  compute  the  components  of  the  product  vec¬ 
tor  b 9  =  lb9,  /= l. ••♦.*)  at  a  rate  compatible  with  that  of  the  generation 
of  the  elements  of  H9 . 

To  summarize  the  behavior  of  the  integrated  system  presented  in  this 
chapter,  we  show  In  Figure  7.11  a  time  diagram  of  the  data  on  all  the 

input  and  output  links  of  the  global  system  of  Figure  7.2.  It  represents  a 
translation  of  the  sequence  equations  (7.9).  (7.24).  (7.32).  (7.35)  and  (7.36.b) 
for  the  special  case  k=q=3.  The  data  items  in  the  input  sequences  (2- 

ff9.qtl-  p9.q  +  r  a9.q  +  l  and  ^3kt9,q+l  dePend  on  the  ,,nite  e'ement  that  is 

being  processed  and  hence  must  be  provided  from  outside  the  system.  On 

the  other  hand,  the  data  in  p. (  g=l.***.q  do  not  depend  on  a  particular 
finite  element  and  thus  are  provided  from  a  memory  local  to  the  system. 

Note  that  we  assumed  that  the  network  of  Section  6.6.1  is  used  for  the 
generation  of  the  elemental  load  vectors. 


in  general,  the  time  for  completing  the  computations  for  one  finite  ele¬ 
ment  is  9k+q+l0  time  units.  in  the  next  section,  we  will  show  that  the 
computation  for  different  finite  elements  can  be  pipelined  through  the  system 
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and  that  the  elemental  arrays  can  be  generated  at  a  rate  of  one  stiffness 
matrix/load  vector  every  3k  time  units. 


7.7.  Verification  of  Pipelined  Operation. 

From  the  previous  description  of  the  subnetworks  N1  through  N6.  it 
should  be  easy  to  check  that  all  computations  are  inert  in  the  sense 
defined  in  Section  5.4.  and  hence  that  the  m  computations  corresponding  to 
the  m  finite  elements  can  always  be  pipelined  through  the  subnetworks. 
Moreover,  the  techniques  of  that  section  may  be  used  to  prove  that  we  can 
achieve  the  maximum  pipeline  efficiency  by  taking  the  pipe  separation  r=3k. 
which  is  the  maximum  span  involved  in  the  computation  corresponding  to 
one  finite  element. 

The  proof  procedure  is  basically  the  same  for  the  six  subnetworks, 
hence,  we  will  only  demonstrate  its  application  to  one  subnetwork,  namely 
N4.  Recall  that  in  Section  7.4  the  network  I/O  description  of  N4  was  found 
to  be  given  by  equation  (7.20).  which  is 

--  L  n'-8*''9  in'-9  W9  ■  P9  i  <7.37) 

0=1 

Moreover,  when  the  inputs  for  a  certain  instance  C®  of  the  computation  are 


n 

P 


e 


9 -9 


nP+3*+8  17* 

ng+3k+8  v% 
9 


then  the  outputs  are  given  by 


e  n2l  +3k  +q  -9  e 

S' .q+1  “  n  71  i 


0  =  1.  •  •  •  ,q  (7. 38. a) 

0  =  1 .  •  •  •  ,q  (7.38.b) 


/  =9.  •  •  •  ,3k  +8  (7.38  c) 


where  the  detailed  forms  of  the  sequences  v®  and  containing  the  input 
data  for  C®,  and  the  sequences  vf  containing  the  results  of  C®  are  speci¬ 
fied  by  (7.16)  and  (7.22).  respectively.  For  the  following  discussion,  we  do 


not  need  these  detailed  forms,  it  suffices  to  know  that  T(v®)  =  T(v^>  =3k 


and  T (ti  , )  =  3k-(l-9),  and  hence  that  the  minimum  pipe  separation  is  rm  - 
3k. 

if  the  computations  for  the  different  elements  are  pipelined  through  N4 
with  a  separation  of  3k.  then  the  inputs  should  nave  the  form 

'  -  a8’3*'6  .  no  *3*t8  -<«*» . 

9  .fif  e-l.m  g  e  =  l.m  9.g 

and 

*  rtg+3k+8  _3k  ,  e.  rtg+3k+8  a3 k  -(g+3k+Q)  e  . 

p q  _  =  fr  =  fr  Pa  J 

9  ,p  e-l.m  g  e=l,m  9.g 

where  we  followed  the  notation  developed  in  Section  5.4.  Using  this  in  the 

network  I/O  description  (7.37).  we  get  the  pipeline  outputs  in  the  form 


'  =  r  n I -*+1-9  [n'-9  n0+3A+8  3k  -(gt3k+8)  e  „ 

S.qtl  ^  e=1.mUi  *9,fl 


_gt3kt8  e3k  ,-(0+3k+8)  e 
*  Pe  =  l.m(n  p9.g)] 


Now.  by  properties  PI  and  P8  in  the  Appendix  we  obtain 


*  /  +3k  +</  3k  -(/+3  k+q)  £  /-8+q-g/-9  e  ,  e 

S.q+i  n  e=1.ml  n  ^  n  11  ff9.g  p9.< 


.J  ) 


which  by  (7.37)  and  (7.38.C)  reduces  to 

=  n'*3**'  p3*  <n,_9  »r> 

s/.q  +  l  e=l.m  / 


(7.39) 


Finally,  because  of  T(n  S7?®)  =  l-9+T(7>®)  =  k.  we  use  P8  to  write 
(7.39)  as 


>  fi2'*3 *♦»-»  p3*  _(, f) 


which  proves  that  the  sets  of  results  {77®;  /=9.  •  •  •  .3k +8)  of  the  different 


1  ,8 


instances  e=l .♦•*.m  will  be  correctly  produced  at  the  rate  of  ^  if  the  set 


of  Inputs  iV*g  .  v®;  g=l.***.q)  are  pumped  through  N4  at  the  same  rate. 


Similarly,  we  can  prove  that  the  other  subnetworks  can  operate  suc¬ 
cessfully  at  maximum  pipeline  efficiency.  Given  that  the  output  of  the  com¬ 
putations  associated  with  a  specific  element  e  is  described  by  (7.32)  and 
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(7.36.b).  it  is  then  easy  to  verify  that  the  results  of  pipelining  the  computa¬ 
tions  associated  with  the  m  elements.  e=l .•••.m,  is  described  by 


’u  ,q+4 


^6  ufw+7 


P 


3k 

e=l  ,m 


(  e 


(7.40) 


where  v?  is  as  described  in  (7.32)  and  (7.36.b)  with  the  superscript  e  used 

to  designate  the  results  of  the  computations  associated  with  the  element  a. 

Clearly.  (7.40)  shows  that  the  system  may  compute  all  the  elemental  arrays 

in  a  time  equal  to  t  +3/c(m-l),  where  t  =9fc+q+10  is  the  time  consumed  by 

c  c 

the  first  instance  of  the  computation,  that  is  the  time  for  the  generation  of 
H1  and  o1 . 

Remark  1: 


A  careful  examination  of  the  system  presented  in  this  chapter  shows 
that  the  time  of  the  computation  of  the  m  elemental  matrices  and  m  ele¬ 
mental  vectors  may  be  reduced  to  t'  +(2/c+l)(m-l)  for  some  special  prob- 

c 

lems  in  which  the  coefficients  a®  /  are  equal  to  zero  for  r=0  or  1=0. 
Examples  of  this  important  class  of  problems  are  the  heat  flow,  the  plain 
strain  and  plain  stress  problems  157].  To  obtain  this  time  reduction,  some 
control  parameters  have  to  be  changed  as  well  as  the  forms  of  the  input 
sequences.  At  this  point  we  note  that  with  the  technique  of  Section  5.4,  it 
can  be  proved  that  a  successful  pipelining  of  the  operation  on  the  modified 
network  requires  a  pipe  separation  r  equal  to  2k+l.  This  is  larger  than  the 
minimum  pipe  separation  r  -2k  for  the  computation,  which  means  that  the 
modified  network  cannot  operate  at  maximum  pipeline  efficiency. 

Remark  2: 

it  may  be  sometimes  desirable  to  slow  down  the  rate  at  which  the  out¬ 
put  is  generated.  More  specifically,  we  may  require  that  the  output 
sequences  have  the  forms 


u=0.  •  •  •,* 


/  _  rtc  (2u+k)+<7  +  16  _c  -1  -e 

<u.q+4  “  n  8  ^ 

for  some  integer  c>3.  rather  than  c=3  in  the  formulas  (7.32)  and  (7.36.b). 
This  flexibility  may  be  accomplished  by  allowing  for  a  modification  of  the 
control  parameters  in  the  systolic  cells.  This  is  especially  applicable  if  the 
cells  are  controlled  by  external  control  signals  that  propagate  systolicaliy  in 
the  network.  In  this  case  the  form  of  the  input  should  be  changed  accord¬ 
ingly  and  the  separation  during  pipelined  operation  should  be  set  to  ck. 
We  will  not  consider  here  the  equations  that  describe  the  operation  of  the 
modified  network  in  any  detail. 

Remark  3: 

In  the  case  of  d>  1  degrees  of  freedom  per  node,  (see  Remark  1  in 

Chapter  ?).  the  input  coefficients  a®  f  in  ALG5  are  dxd  matrices  and  the 

0  2 
entries  H.  are  dxd  submatrices.  In  order  to  compute  the  d  elements  of 

H®  without  slowing  down  the  system,  we  may  replace  the  subnetwork  N5 
2 

by  d  identical  subnetworks,  each  of  which  generates  the  corresponding 
entry  in  the  submatrix  W®  .  when  provided  with  the  appropriate  entry  in  the 
dxd  matrices  a®^,  r./=0.1.2.  Similarly,  d  identical  copies  of  the  subnet¬ 
work  N6  are  needed  for  the  simultaneous  computation  of  the  d  components 
of  the  subvector  be.  . 


8.  THE  ASSEMBLY  STAGE. 


The  elemental  arrays  H 9  and  be  generated  by  the  systolic  system  of 
the  last  chapter  are  the  main  contributors  to  the  global  stiffness  matrix  H 
and  load  vector  b.  The  assembly  of  the  contribution  of  a  specific  finite 
element  e  to  the  global  arrays  begins  with  the  modification  of  H9  and  b9 . 
if  necessary,  to  account  for  the  boundary  conditions.  The  elements  of  the 

modified  arrays  H  and  b  .  e=l, •••,/>?  are  then  appropriately  scaled  and 
assembled  according  to  the  formulas  (6.8/10).  In  order  to  scale  the  arrays 
of  a  specific  element  e,  we  need  to  know  the  global  labels  of  the  nodes 
(e./).  /=!.♦•  •,*  belonging  to  that  element.  Given  the  local/global  mapping. 

the  assembly  of  H9  and  b9 .  e=1.***.m  may  be  described  by  the  following 
algorithm: 

ALG6 

1)  Initialize  the  global  matrix  H  and  the  global  vector  b  to  zero. 

2)  FOR  e=1  TO  m  DO 

2.1)  FOR  i=l  TO  k  DO 

2.1.1)  FOR  j=1  TO  k  DO 

H  (glob  (e  .1 ) .glob  (e ,/' ))  =  H  (glob  (e  ,i).glob  le ./))  + 

2.1.2)  btgloble.l))  =  blgloble.i)  +  b? 

In  Sections  1  and  2  of  this  chapter,  we  deal  with  the  generation  of  the 
global  labels  and  the  modification  of  H9  and  b9  to  satisfy  the  boundary 
conditions.  Then  in  Sections  3  and  4.  we  discuss  the  parallel  implementa¬ 
tion  of  the  assembly  stage,  and  finally  In  Section  5.  we  show  that  the 
assembly  stage  may  be  eliminated  if  the  resulting  system  of  linear  equations 


\V. 


(6.7)  is  to  be  solved  iteratively. 

It  should  also  be  mentioned  that  Law  [34]  suggested  a  systolic  archi¬ 
tecture  for  the  assembly  of  the  global  matrix  H.  However,  the  timing  and 
form  of  the  input  data  required  by  his  architecture  are  not  compatible  with 
the  output  generated  by  the  system  discussed  in  Chapter  7.  Moreover.  Law 
assumes  that  each  cell  in  the  array  does  perform  matrix  operations,  which 
of  course  requires  a  larger  clock  cycle  than  the  one  for  the  simple  opera¬ 
tions  used  in  the  system  that  generates  He .  For  these  reasons,  we  were 
not  able  to  use  Law's  array  in  our  design.  This,  of  course,  does  not 
exclude  the  possibility  of  employing  his  array  in  other  designs  of  parallel 
finite  element  systems. 

8.1.  Generation  of  the  global  labels. 

The  purpose  of  this  section  is  to  design  a  subnetwork  that  associates 
the  global  labels  glob(e.l)  and  glob(e.|)  with  each  H®  generated  from  N5. 
and  adjusts  the  entries  of  He  that  correspond  to  the  nodes  at  which  the 
solution  function  should  be  zero  (see  Remark  2  in  Chapter  6). 


Figure  8.1  -  The  graph  for  N7 

In  Figure  8.1.  we  show  the  graph  of  a  network  N7  that  performs  these 

two  tasks.  Its  input  links  z  ^ ,  u=  0. -  -  * .#r — 1  are  to  be  connected  to  the 

u  .q  +4 

corresponding  outputs  of  the  subnetwork  N5  that  carry  the  elements  of  H 
as  described  by  equation  (7.32).  The  Input  links  s0c?t4  an<1  fo,g+4 


carry 


V  V  *. 


X  ■ 


identical  information,  namely  the  global  labels  of  the  nodes  (e./).  /=!.•••.*. 


More  precisely,  we  have 


O.q  +4  *  M0.o+4 


•  n»*7 


e=l.m 


(  02  r®  ) 


(8.1. a) 


where  7(7®)  =  A  and  ve(t)  =  glob  (e.t).  Finally,  the  input  link  P0<?+4  car¬ 
ries  a  single  bit  for  identifying  the  nodes  at  which  the  solution  should  be 
zero,  that  is.  the  nodes  that  lie  on  8Qq.  More  precisely,  we  have 


_  ^wt7  -3k  .  -2  e  . 

O.q+4  ~  n  Pe  =  1.m(  0  *0  } 


(8.1.b) 


where  T  (^q)  =  k  and 


1  If  node  (e.t)  e3Q, 


0  otherwise 


The  operation  performed  by  any  of  the  k  cells  in  the  network  is  very 
primitive.  First,  each  cell  delays  the  data  on  the  s.  r  and  p  links  as  foi- 


^u+l.g+4  “  .<7+4 

°u+l.q+4  “  ngy  au.q+4 
^ u  1 1  ,q  +4  ^  nu  ,q  t4 


u  =0.  •  •  -.*-1 
u=  0.  •  •  •  .fc-1 
u  =0.  -  .k- 1 


This  ensures  that  for  each  cell  u  the  time  of  arrival  of  H(  on  the  link 
z  coincides  with  the  time  of  arrival  of  giob(e.t).  glob(e.t+u)  and 

u  .q  r4 

-y®(f)  on  the  links  r^  su  fl+4  and  pu  (?+4>  respectively,  thus  allowing 

the  cell  u  to  modify  the  elements  of  H®  appropriately,  and  to  produce  the 
modified  elements  on  the  output  link  z  The  cell  also  produces  on 

U  ,Q  TO 

the  output  link  b  -  the  pair  (  glob  (.e.t).  glob  (e  ,t+u)-glob  (e  ,t)  ).  This 

U  ,Q  T O 

pair  specifies  the  location  at  which  H®  ^  Is  to  be  accumulated,  assuming 
that  the  global  matrix  H  is  stored  as  a  band  matrix.  In  terms  of  sequence 
equations,  this  translates  into  the  formulas 


,  _6utw+8  03 k  ,-2— e, 


u  =0.  •••.*-! 


(8. 2. a) 
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.<7+5  =  n6u+w'1'8  /»J*1/n<  8#5*1,1ce2xJ  .  ne2x^  .o'»  u=o.« •  -,*-i  (8.2.b> 

where  5^  Is  as  in  (7.40)  except  that  the  entries  of  H®  are  now  appropri¬ 
ately  modified.  The  sequences  X®  and  X^  are  described  by  T(X®)  =  TO9) 
-  k-u  and 


X"(f)  =  globie.t ) 

X^Cf)  =  g/ob  (e.ftu)  -  globle.t) 


Finally,  we  note  that  a  cell  (k.q+4)  may  be  added  to  N 7  to  modify  the 
elements  of  b9  and  associate  with  each  b®  the  global  label  glob(e.i). 

8.2.  Essential  boundary  conditions. 

The  essential  boundary  conditions  are  the  ones  responsible  for  the 
terms  containing  line  Integrals  In  the  variational  formulation  (6.1/3).  It  was 
shown  that  in  the  finite  element  approximation,  the  effect  of  these  terms 
may  be  isolated  in  the  form  of  a  matrix  S9  and  a  vector  s®  that  are  to  be 
added  to  the  elemental  arrays  H9  and  b9 ,  respectively.  However,  for  a 
given  problem,  most  of  the  arrays  S®  and  s®  are  zero  arrays,  and  if  non 
zero,  they  contain  only  few  non  zero  entries.  Consequently,  adding  special 
hardware  for  the  computation  of  the  few  non  zero  entries  of  S®  and  s®. 
e=l .•••,m  is  not  justified  from  a  practical  point  of  view,  especially  since 
the  general  formula  for  the  generation  of  these  entries  is  complicated  and 
contains  many  coefficients  that  are  set  to  zero  for  particular  problems. 

More  appropriately,  these  few  non  zero  entries  should  be  computed  by 
the  general  purpose  machine  that  controls  the  entire  computation  as  a  part 
of  the  presetting  procedure,  and  then  added  to  the  corresponding  entries  of 
H9  and  b9 .  in  order  to  ensure  the  continuity  of  the  flow  of  data,  the 
addition  should  take  place  in  an  intermediate  sub-system  residing  between 


the  system  that  generates  H  and  b  .  and  the  one  that  assembles  the  glo¬ 
bal  arrays  H  and  b .  Conceptually,  the  Intermediate  sub-system  should 
contain  a  memory  to  store  the  non  zero  entries  of  S ®  and  s®  and  some 
logic  to  add  every  non  zero  entry  to  the  corresponding  entry  in  H9  or  b9 
at  the  time  of  its  generation  from  the  subnetwork  N7.  As  we  did  in  the 
last  section,  we  will  only  consider  the  stiffness  matrices  and  suggest  a  pos¬ 
sible  implementation  for  the  addition  H9  =  H®+S®.  The  extension  to  the 


load  vector  b  should  be  obvious  and  simple. 


The  systolic  nature  of  the  subnetworks  N1, 


,  N7  enables  us  to 


time  exactly  the  data  on  the  output  links  zy  g.  u=0, (see  equa¬ 
tions  (8.2)).  Each  of  these  links  may  be  directed  into  a  systolic  processor 
P  ,  0<u<k-1  that  has  access  to  a  local  memory  Mu  as  shown  in  Figure 
8.2(a).  Each  processor  contains  a  register  'CURRENT_u'  that  it  sets  to 
one  at  time  6u+w+8.  and  Increments  every  3k  time  units.  Hence,  when  an 
element  H®f+u>  Kf<k-u  appears  on  an  input  link  zu  the  correspond¬ 

ing  register  CURRENT_u  in  Pu  should  contain  the  label  e  of  the  finite  ele¬ 


ment  being  processed. 


(a)  The  general  architecture  (b)  The 

Figure  8.2  -  The  assembly  stage 


(b)  The  content  of  M 
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Each  local  memory  Mu,  o<u<*r-i  contains  an  array  INDEX  (see  Figure 

8.2(b))  that  has  one  entry  for  each  finite  element  e=l. *••./>?.  If  the  matrix 

Se  corresponding  to  the  finite  element  e  Is  zero,  then  lNDEX(e)=0.  On  the 

other  hand,  if  Sd  +  0.  then  iNDEX(e)  contains  a  pointer  to  another  array 

'BT_u'  (Boundary  Terms  for  off-diagonal  u)  where  the  entries  tft+u' 

th  0 

(=!,••  *./t-u  of  the  u  off-diagonal  of  S  are  stored  (including  zeroes)  in 
the  specified  order.  This  order  is  the  same  as  the  one  in  which  the  ele¬ 
ments  l<f<*-u  appear  on  zy  ^+5. 

Thus,  at  times  6u+w+8+3(e-l)fr .  e=l.***,m,  after  the  processor  has 

changed  the  value  of  CURRENT^ u,  it  consults  INDEX (CURRENT_u).  If  it  is  a 

zero.  then,  for  the  next  3k  time  units,  the  data  items  on  the  input  link 

z  are  placed  unchanged  on  the  output  link  z  .  Otherwise.  P 

u.q+5  u.q+  6  u 

retrieves  from  BT_u  the  elements  S®?+u.  f=l. •••  .k-u.  one  every  three  time 
units,  adds  each  of  them  to  the  corresponding  ^  tfu  and  puts  the  result 

on  Vote 

An  alternative  architecture  could  be  obtained  by  replacing  each  with 
an  associative  memory  that  uses  the  labels  K«<m  as  keys  to  store  the 
array  BT_u,  or  by  using  only  one  global  memory  M  instead  of  the  k 
memories  u-Q.  •  •  •  .*-1.  In  all  of  the  above  cases,  the  content  of  the 
memory  is  computed  and  preloaded  by  the  host  computer.  A  completely 
different  approach  would  be  to  perform  the  matrix  addition  on  a 

systolic  network  that  does  not  have  any  memory.  However,  matrix  addition 
is  a  communication-bound  rather  than  a  compute-bound  operation,  and  in 
our  case,  most  of  the  data  in  Se  are  trivial  (zeroes).  Consequently,  such 
a  memoryless  subnetwork  would  require  many  unnecessary  data  communica¬ 
tion.  which  we  tried  to  avoid  in  our  design. 


8.3.  The  frontal  principle. 

The  assembly  stage  in  the  finite  element  analysis  is  an  example  of  a 
simple  computation  that  is  irregular,  and  hence,  does  not  lend  itself  to  a 
simple  systolic  implementation.  In  Section  8.5.  we  will  show  that  the 
assembly  stage  may  be  eliminated  from  the  analysis  if  an  iterative  scheme 
is  used  for  the  solution  of  the  resulting  system  of  equations  Hu=b.  How¬ 
ever.  if  a  direct  solver  is  to  be  used,  then  the  global  arrays  H  and  b  must 
be  assembled  in  order  to  proceed  with  the  direct  solution.  In  what  follows, 
we  will  only  consider  the  assembly  of  the  stiffness  matrix  H.  Although  we 
will  not  discuss  the  assembly  of  the  load  vector  b.  we  note  that  it  may  be 
included  here  by  considering  b  as  an  additional  column  of  H. 

At  first  glance.  It  would  appear  that  the  solution  of  Hu=b  may  not  start 
before  the  assembly  of  H  is  completed.  Especially  in  a  parallel  system, 
this  would  have  two  disadvantages;  firstly,  it  does  not  allow  the  computation 
of  the  different  components  of  the  system  to  proceed  in  parallel,  and 
secondly,  it  requires  some  intermediate  storage  to  store  H.  Since,  in  prac¬ 
tice.  finite  element  problems  with  n  in  the  order  of  several  thousands  are 
not  uncommon,  it  is  obvious  that  auxiliary  storage  (disks)  may  be  needed 
for  H.  thus  slowing  down  the  system  even  more. 

Fortunately,  we  do  not  have  to  wait  until  H  is  completely  assembled 
and  we  may  start  the  solution  process  as  soon  as  some  rows  of  H  are 
assembled.  The  assembly  and  the  solution  processes  may  then  proceed  in 
parallel  in  a  producer/consumer  type  of  interaction:  The  assembler  process 
being  the  producer  of  the  assembled  rows  of  H.  and  the  solution  process 
being  the  consumer  of  these  rows. 

— £  [fa 

in  order  to  explain  this  assembly  technique,  we  denote  by  h.  the  / 

— g  th 

row  of  the  elemental  matrix  H  and  by  h.  the  /  row  of  the  global  matrix 


H.  We  also  note  that  a  node  with  a  global  number  /  may  share  more 
than  one  finite  element,  and  hence,  may  have  more  than  one  local  label. 

Each  row  h^.  1  </<n.  of  the  matrix  H  corresponds  to  a  certain  node  in 
the  finite  element  mesh,  namely  node  /.  This  row  is  assembled  by  accu¬ 
mulating  contributions  from  the  rows  fijj.  where  e1.*-*.er  are 

the  elements  that  share  node  /.  and  (el./l ),•  •  *.(er./r)  are  the  correspond¬ 
ing  local  labels  of  node  /. 

In  accordance  with  the  system  of  Chapter  7.  we  will  assume  from  now 
on  that  during  the  assembly  process  the  elemental  matrices  are  processed 

_ 1  -j 

in  the  order  H  ,  ♦  ♦  •  .m  .  and  the  rows  within  each  elemental  matrix  are 

considered  in  the  order  ,  ♦  •  ‘.h^ .  The  elements  in  each  row  h.  are 
accumulated  in  the  proper  position  of  the  global  matrix  H  (see  ALG6).  or 
more  precisely  in  row  h.  of  H.  where  /=g/ob(e./). 

Before  going  further,  we  introduce  some  terminology.  During  the 
assembly  process,  a  row  h.  is  said  to  be  active  from  the  moment  of  the 

appearance  of  a  row  with  g/ob  (e  ,/)=/.  that  is  from  the  time  when  its 
assembly  actually  starts.  On  the  other  hand.  is  called  a  ready  row 

immediately  after  the  accumulation  of  the  last  row  ^  with  g/ob(r  ./)=/,  that 
is  after  its  assembly  has  been  completed.  In  other  words,  a  certain  row  of 
H  is  partially  accumulated  in  the  period  between  the  instance  it  becomes 
active  until  the  instant  it  becomes  ready.  Once  it  has  become  ready,  a 
row  may  be  processed  by  the  solver. 

These  ideas  were  first  formulated  in  the  framework  of  the  so  called 
frontal  technique  [24]  used  in  sequential  finite  element  systems.  The  goal 
is  to  interleave  the  assembly  and  the  solution  phases  in  order  to  minimize 
the  high-speed  storage  requirements.  The  order  in  which  the  rows  of  H 


become  ready  is  usually  determined  by  a  preprocessing  step. 

The  same  basic  idea  will  be  used  in  our  system  to  achieve  our  goal 
of  allowing  the  assembly  and  the  solution  processes  to  be  executed  in 

parallel.  More  specifically,  whenever  a  row  in  H  becomes  ready,  it  will  be 
passed  to  the  solution  process.  This  also  allows  for  the  reduction  of 
storage  for  the  assembly  since  the  storage  allocated  for  a  row  may  be 

released  whenever  this  row  is  passed  to  the  solution  process  (consumed). 

However,  in  all  the  known  parallef  schemes  for  the  direct  solution  of 
Hu=b.  the  rows  of  H  have  to  be  processed  in  a  sequential  order,  and 

hence,  the  size  of  the  storage  required  by  the  assembly  stage  is  deter¬ 
mined  by  the  maximum  number  of  rows  that  are  at  any  one  time  either 
active  or  ready  but  not  yet  passed  to  the  solver  (consumed)  because  a 

preceding  row  is  not  yet  ready.  For  this  reason,  the  global  labels  given  to 
the  nodes  in  the  finite  element  mesh  should  be  such  that  the  rows  of  H 

become  ready  in  an  almost  sequential  order.  By  this  we  mean  that  there 

should  exist  a  relatively  small  constant  c  such  that  for  any  /  and  /  satisfy¬ 
ing  1  *n  and  /</tc.  row  /  does  not  become  ready  before  row  i.  With 

this  restriction,  we  can  restore  the  sequential  order  by  using  a  buffer  large 

enough  to  store  up  to  c  rows  of  H. 

if  a  band  storage  mode  is  used  for  storing  the  rows  of  the  banded 

matrix  H ,  then  it  is  also  advantageous  to  minimize  the  bandwidth  of  H.  in 
this  connection,  it  has  to  be  noted  that  the  bandwidth  of  H  is  determined 

by  the  global  numbering  of  the  nodes  in  the  corresponding  finite  element 
mesh,  in  fact,  given  a  global  numbering  and  denoting  by  2B  +  1  the 

bandwidth  of  the  matrix  H  resulting  from  this  numbering,  it  is  easy  to  show 

tnat 

=  ma xti9  -  if  ;e=l,,,,,m) 


S 


(8.3) 


where  l9  and  if  are  the  largest  and  smallest  global  node  numbers  in  the 
finite  element  e.  respectively. 

Many  heuristic  node  numbering  algorithms  were  suggested  for  reducing 
the  bandwidth  (e.g.  [13]  ).  However,  if  the  assembly  and  the  solution 
processes  are  to  be  executed  in  parallel,  then  we  need  a  numbering 
scheme  that,  in  addition  to  reducing  the  bandwidth,  has  the  goal  of  reduc¬ 
ing  the  number  of  active  rows  of  H  at  a  given  time,  and  of  producing  the 
assembled  rows  of  H  in  an  almost  sequential  order.  The  following  algo¬ 
rithm  takes  these  goals  into  consideration. 

ALQ7 

1)  Assign  a  unique  label  e.  1<e<m  to  each  of  the  m  finite  elements. 

2)  Number  the  nodes  sequentially  in  the  following  order 
2.11  Number,  in  arbitrary  order,  the  nodes  of  element  1. 

2.21  FOR  e=2.  •  •  •  ,m  DO 

2.2.1)  Number,  in  arbitrary  order,  the  nodes  in  elements  e  that 
are  not  yet  numbered. 

ALG7  does  not  specify  the  method  of  labeling  the  elements  e*1  .•••.m. 
If  the  element  labels  satisfy  the  property  that  for  any  e.  Ke<m,  element  e 
contains  at  least  one  node  that  does  not  belong  to  any  element  l.***.e-l. 
then  we  call  such  a  labeling  scheme  a  proper  element  labeling.  For 
example,  the  element  labeling  in  Figure  8.3(a)  does  satisfy  the  above  condi¬ 
tions.  and  hence  is  a  proper  labeling.  On  the  other  hand,  the  labeling  for 
the  same  mesh  in  Figure  8.3(b)  is  not  proper  because  element  10  does  not 
contain  any  node  which  is  not  in  the  elements  l.***,9.  With  this  defini¬ 
tion.  we  can  prove  the  following  proposition; 

Proposition  8.1:  If  the  nodes  of  the  finite  element  mesh  are  numbered  using 
ALQ7  and  the  labeling  in  step  1  is  a  proper  element  labeling,  then  for  any 
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(a)  A  proper  labeling  (b)  A  non  proper  labeling 

Figure  8.3  -  Elements  labeling. 

e.  1  <e<m,  the  elements  e.***.m  do  not  contain  a  node  with  a  global 

number  smaller  than  l9  -8.  where  7®  is  the  largest  node  number  in  ele¬ 
ment  e.  and  28  tl  Is  the  bandwidth  of  the  matrix  resulting  from  this  partic¬ 
ular  node  numbering. 

Proof:  ALG7  and  the  proper  element  labeling  imply  that  for  any  r>e.  the 

finite  element  r  should  contain  a  node  with  a  label  lr  >  Is .  However,  from 

equation  (8.3)  we  find  that  lr  -  £  <8  or  £  >  Ir-B  >  Is -B.  Hence, 
because  the  label  of  any  node  In  element  r  is  larger  than  £ ,  we  conclude 
that  the  elements  e+l.,,,.m  do  not  contain  a  node  with  a  global  label 

smaller  than  or  equal  to  l9  -8 ,  Now.  the  result  of  the  proposition  follows 
from  the  fact  that  any  node  in  element  e  has  a  label  larger  than  or  equal 

to  £  -  f  -8 .  ■ 

In  our  system,  we  will  assume  that  the  global  node  numbering  scheme 
satisfies  the  conditions  stated  in  Proposition  8.1,  and  that  the  finite  elements 
are  processed  in  the  order  1,*-*,m.  Then,  the  proposition  guarantees  that 

during  the  assembly  of  the  contributions  of  an  elemental  matrix  H9  into  the 


global  matrix  H .  the  rows  !.•••. 7® -fl-1  of  H  are  completely  assembled  and 

will  not  be  modified  by  any  contribution  from  the  elemental  arrays  f? .  r>e. 
In  other  words,  if  row  / ,  IKi Zn  in  H  is  active,  then  the  rows  up  to  /'  — 

(fi+1)  are  ready  and  may  be  processed  by  the  solver  and  thus  the  storage 
associated  with  them  may  be  released. 

Definition:  If.  at  a  specific  time  during  the  assembly  of  H.  a  certain  row  t. 
is  active,  then  the  rows  !.•  •  ♦,/-(fl+1)  are  called  B_ready  rows  of  H. 
From  Proposition  8.1,  it  follows  that  B_ready  rows  are  ready  rows  of  H. 
However,  a  given  row  may  be  ready  before  it  becomes  B_ready.  Being 
pessimistic,  we  will  follow  the  rule  of  not  allowing  the  solution  process  to 

access  a  certain  row  in  H  before  that  row  becomes  B_ready.  except  of 
course  for  the  last  rows  that  may  be  accessed  only  after  the 

assembly  of  H  is  completed.  This  may  decrease  somewhat  the  efficiency, 
but  it  has  the  advantage  of  eliminating  the  preprocessing  stage  that  would 
otherwise  be  required  for  determining  the  time  at  which  a  certain  row  in  H 
becomes  ready. 

if  the  above  rule  is  used  as  the  basis  for  the  interaction  between  the 
assembler  and  the  solution  processes,  then,  the  assembler  process  should 
contain  storage  for  holding  at  least  Stl  active  rows  of  H .  Moreover,  as  in 
the  case  of  any  producer/consumer  problem,  additional  buffers  may  be 
required  depending  on  the  relative  speeds  of  production  of  the  B_ready 
rows  and  their  consumption. 

It  is  not  hard  to  see  that,  due  to  the  nature  of  the  assembly  process, 

the  rate  at  which  the  rows  of  H  become  B.ready  is  not  constant.  Hence 

we  will  consider  the  average  of  that  rate.  This  average  rate  differs  from 
one  problem  to  the  other.  In  order  to  be  more  specific,  we  first  note  that 

the  inputs  to  the  assembly  stage  are  the  mk  rows  h? .  /=!,•  ••.*. 


e=l.  •••./»».  and  that  the  outputs  from  that  stage  are  the  n  rows  h/t 
#  =1  We  also  assume  that  the  execution  time  of  the  assembly  pro¬ 

cess  is  7\  time  units  and  that  its  data  bandwidths  are  sufficient  to  transmit 
a 

the  input  of  one  row  .  as  well  as  the  output  one  B_ready  row  at  a  time. 
With  this,  we  suppose  that  the  rate  of  arrival  of  the  rows  at  the  assembly 

mie 

stage  is  constant,  and  we  denote  this  rate  by  r.=  y-  rows/time  unit.  We 

a 

also  define  the  average  rate  at  which  the  assembly  stage  produces  the 
B_ready  rows  of  H  as  rp-  y-  =  ^  rf.  Since  the  ratio  ^  changes  from 

fi 

one  problem  to  another,  it  should  be  clear  that,  for  a  fixed  input  rate,  the 

average  rate,  at  which  the  rows  of  H  become  B_ready.  does  depend  on  the 

problem  at  hand. 

Hence,  in  general,  we  cannot  achieve  the  desired  match  between  the 

average  rate  of  production  of  the  B.ready  rows  and  the  rate  at  which  the 

solution  process  Is  ready  to  consume  these  rows.  More  specifically,  if  the 

solution  process  is  capable  of  processing  (consuming)  r  rows  of  H  every 

c 

time  unit,  then  we  have  one  of  the  following  possibilities: 

1)  r  <7  .  that  Is  the  solver  cannot  consume  the  B_ ready  rows  of  H 

c  p 

fast  enough.  In  this  case,  the  number  of  B_ready  but  unconsumed 
rows  grows  continuously  and  so  does  the  size  of  the  memory  require¬ 
ment  of  the  assembler  process. 

2)  rc*~p-  in  which  case  the  assembly  stage  needs  only  to  provide 
storage  for  B+l+K^  rows  of  H.  where  is  the  size  of  the  storage 
needed  to  buffer  the  fluctuations  in  the  production  rate  of  the 
B_ready  rows  of  H. 

From  the  above  discussion,  it  is  clear  that  to  limit  the  size  of  the 
storage  required  in  the  assembly  process,  we  need  to  guarantee  that  rc>~p- 


This,  however,  will  force  the  solution  process  to  execute  at  a  speed  slower 
than  its  nominal  speed  r  .  because  often  it  will  be  forced  to  stop  execution 
and  wait  for  the  B_ready  rows  of  H  to  become  available.  As  a  result,  the 
synchronization  between  the  solution  and  the  assembly  processes  can  no 
longer  be  controlled  by  a  global  clock.  Instead,  a  shake  hand  protocol 
should  be  used  for  that  purpose  (see  Section  5.5). 

Assuming  that  r  >7  .  we  may  control  the  size  of  the  additional  buffer 
c  p 

by  controlling  the  fluctuation  in  the  rate  T  of  production  of  the 

B.ready  rows.  More  specifically,  if  we  can  guarantee  that,  at  any  time  t 

during  the  assembly  process,  the  rate  rp  It)  of  production  of  the  B_ready 

rows,  is  smaller  than  or  equal  to  the  consumption  rate  r  .  then,  any  row  of 

c 

H  will  be  consumed  as  soon  as  it  becomes  B_ready.  thus  reducing  the 
size  of  the  additional  buffer  K.  to  zero. 

D 

Here,  we  note  that  the  rate  r  is  uniquely  specified  by  the  solution 

c 

process,  and  hence  that  the  relation  between  r  and  r  (f)  cannot  be 

c  p 

obtained  by  studying  solely  the  assembly  process.  However,  by  specifying  in 
ALG7.  a  certain  order  for  numbering  the  nodes  within  each  element,  rather 
than  keeping  this  order  arbitrary,  we  can  prove  that  r^(f)  may  not  .  at  any 
time  f.  exceed  the  constant  rate  r.  of  arrival  of  the  rows  at  the  assembly 
stage.  This  result  will  be  used  in  Section  9.1.  We  start  by  modifying 
ALG7  to  obtain  the  following  node  numbering  algorithm. 

ALG8 

Given  a  proper  element  labeling  for  the  finite  elements  and  a 
corresponding  local  numbering  of  the  nodes,  obtain  the  global  numbering 
by  giving  the  nodes  sequential  numbers  /=!.*••./»  in  the  following 


order: 


glob  (i  ./)=/ 


2)  FOR  9=2."  -.m  DO 
FOR  /  =  1 .  -  •  '.k  DO 

IF  node  ( 9.1)  is  already  numbered  THEN  skip 
ELSE  increase  /  by  one  and  set  glob  (9  ./)=/. 

Now.  we  can  prove  the  following  result: 

Proposition  8.2:  Let  the  nodes  in  the  finite  element  mesh  be  numbered  by 
algorithm  ALG8  and  let  the  elemental  arrays  be  accumulated  in  the  global 

array  in  the  order  with  the  rows  within  each  H6  being  accumu¬ 

lated  in  the  order  ,  •  •  •  .h^ .  Then,  the  rows  of  H  become  active  in  a 
purely  sequential  order. 

Proof:  Consider  any  two  rows  and  hj2  of  H  with  /l</2,  and  let  (el. /I) 
be  the  local  label  for  node  /I  such  that  glob  (e  1./ 1)=/ 1.  and  that  we  have 
el'>el  for  any  other  local  label  (el './!')  with  glob  (e  1',/ 1')=/ 1.  In  other 
words,  element  el  is  the  first  element  containing  node  /I.  Similarly,  let 
(e2./2)  be  the  local  label  of  12  with  respect  to  the  first  element  containing 
it.  From  the  definition  of  active  rows,  we  know  that  the  rows  hj  1  and  hj2 

-Q  1  _0  9 

of  H  become  active  when  rows  and  hj2  are  received  by  the  assem¬ 

bler.  respectively.  However.  ALG8  and  the  fact  that  / 1  </ 2  .  together  imply 
that  either  el<e2  or  el=e2  and  / 1  </ 2.  In  both  cases,  we  conclude  from 

-9  1 

the  hypothesis  of  the  proposition  that  h .  ^  is  received  by  the  assembler 
-9  2 

before  h^.  and  hence  that  row  /I  becomes  active  before  row  /2.  ■ 
Proposition  8.3:  Assume  that  the  hypotheses  of  Proposition  8.2  apply,  and. 

moreover,  that  the  rate  at  which  the  rows  h.f  .  /=1.*  ••.*.  e=l.***.m. 
arrive  at  the  assembly  stage  is  constant.  Then  the  rate  of  production  of  the 
B_ready  rows  r  (t)  at  any  time  t.  cannot  exceed  r.  rows/time  unit. 


Proof:  Note  here  that  the  arrival  of  a  certain  row  h* ,  !</<*,  Ke<m.  at 
the  assembler  may  at  most  activate  one  row  of  H.  namely  the  row  labeled 

i  -glob  Ce.  /).  Hence,  after  the  arrival  of  hf .  the  rows  _  .  are. 

by  definition.  B_ready  rows.  However,  from  Proposition  8.2.  we  know  that 
row  h._ 1  should  already  be  active  at  the  instant  when  row  h.  becomes 

active.  That  is.  before  the  arrival  of  .  rows  were  B_ready. 

In  other  words,  the  arrival  of  hf  may  create  at  most  one  B_ready  row. 
namely  row  h/tfl+r  ■ 

We  now  return  to  algorithms  ALG7  and  ALG8.  Although  these  algo¬ 
rithms  provide  good  numbering  schemes  from  the  point  of  view  of  process¬ 
ing  the  assembly  and  solution  processes  In  parallel,  we  still  have  to  ensure 
that  they  do  not  result  in  a  large  value  of  B.  For  this  we  note  that  ALG7 
and  ALG8  are  two-step  algorithms.  First,  the  finite  elements  e*l.***.m  are 
labeled,  and  then  the  nodes  within  the  elements  are  numbered.  To  our 
knowledge.  Fenves  and  Law  [15]  were  the  first  to  suggest  a  two-step 
numbering  algorithm.  They  reported  experimental  results  which  show  that  if 
the  Cuthiii-McKee  [13]  algorithm  Is  used  to  number  the  finite  elements  with 
a  two  step  algorithm,  then  the  bandwidth  of  the  resulting  matrix  H  is  com¬ 
parable  with  the  one  resulting  from  the  best  known  heuristic  algorithm  for 
minimizing  the  bandwidth.  Here,  in  the  application  of  the  Cuthiii-McKee 
algorithm  for  labeling  the  elements.  Fenves  and  Law  consider  two  elements 
to  be  neighbors  if  they  share  a  common  boundary. 

Clearly,  the  hypothesis  In  Proposition  8.1.  requiring  the  element  labeling 
scheme  in  ALG7  to  be  proper,  is  essential.  In  order  to  see  this,  consider 
the  simple  example  of  Figure  8.4(a)  where  the  element  labeling  is  not 
proper  (element  3  does  not  contain  a  node  not  in  elements  1  or  2). 


Figure  8.4  -  Application  of  ALG7  for  node  numbering. 


According  to  the  results  in  (15).  we  may  obtain  a  relatively  small 
bandwidth  with  a  two  step  node  numbering  algorithm  if  we  use  the  Cuthill— 
McKee  scheme  to  number  the  elements.  However,  we  note  that  this  may 
result  in  a  non  proper  element  labeling  scheme.  For  example,  the  ele¬ 
ments  in  Figure  8.3(b)  were  labeled  using  the  Cuthill-McKee  algorithm  but 
the  resulting  labeling  is  not  proper  (because  of  element  10).  The 
corresponding  node  numbering  is  shown  in  Figure  8.4(b).  where  the  largest 

node  number  in  element  9  is  /  =20,  but  element  10  contains  a  node  with  a 

_Q 

number  smaller  than  /  -8=12,  namely  node  9.  In  this  case,  however,  a 
proper  element  labeling  may  be  obtained  by  starting  the  Cuthill-McKee  algo¬ 
rithm  from  a  different  element  (see  e.g.  Figure  8.3(a)).  We  examined  many 
strange  shapes  of  meshes  and  in  only  rare  cases  did  the  application  of  the 
Cuthill-McKee  algorithm  result  in  a  non  proper  labeling.  Moreover,  in  all 
these  rare  cases  a  proper  labeling  was  easily  obtained  by  changing  the 
starting  element.  The  existence  and  construction  of  a  proper  element 
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labeling  scheme  for  a  given  finite  element  mesh  is  still  a  question  that 
needs  to  be  answered. 

Finally,  we  note  that,  by  allowing  the  solution  process  to  access  the 
rows  of  H  only  when  they  become  B_ready,  we  do  not  increase  the  storage 
requirement  of  the  assembly  process.  As  an  implication  of  Propositions  8.1 
and  8.3.  this  storage  should  be  large  enough  to  hold  8+1  rows  of  H . 
However,  this  is  the  minimum  amount  of  storage  that  should  be  provided  by 
the  assembler  even  if  the  rows  of  H  were  accessed  as  soon  as  they 

become  ready.  In  fact,  there  always  exist  an  element  e  such  that 
and  hence  the  assembly  of  this  element  does  require  the  storage  of  8+1 
rows  of  H.  assuming  of  course,  that  the  rows  of  H  are  stored  in  consecu¬ 
tive  locations. 

8.4.  A  parallel  implementation  of  the  assembly  process. 

The  discussion  of  the  last  section  showed  that  the  potential  parallelism 
between  the  assembly  and  the  solution  processes  may  be  paralyzed  if  the 
solution  process  is  designed  to  access  the  rows  of  the  assembled  matrix  H 
in  order  and  the  node  numbering  scheme  does  not  take  this  into  con¬ 
sideration.  Here,  we  prevent  this  from  happening  by  assuming  that  the 
node  numbers  satisfy  the  conditions  of  Propositions  8.1  and  8.3. 

Next,  we  modify  the  definitions  of  and  h.  such  that  /»?  = 

— £ 

[Hj  .  ;/=/.••  ’  ,k)  and  h.  =  (H/  ^  ;/=/.*  •  *./+8)  are  the  sets  of  elements  of 

th  — © 

interest  in  the  i  rows  of  the  symmetric  matrices  H  and  H,  respectively. 

In  addition,  we  denote  by  L®  =  ((glob  (e  ,i).giob  (e ,/));  /=/,•••.*}  the  set 

that  contains  the  information  about  the  position  at  which  each  element  of 

h ®  is  to  be  assembled  into  the  global  matrix  H.  With  this,  we  may 


describe  the  assembler  process  as  follows 
PROCESS  'ASSEMBLER' 

Max_ready  :=  0  ;  Consumed  :=  0  ;  HO  :=  0  ; 

Interrupt  I:  /*  High  priority  *7 

1)  Get  and  L®  from  Lport  ; 

2)  Accumulate  the  elements  of  In  HO  ; 

3)  IF  (Max_ready  <  glob(e.i)-B-l)  THEN  Max_ready  :=  giob(e.i)-B-l  ; 

4)  EXIT  from  the  interrupt. 

Interrupt  0:  /*  Low  priority  */ 

1)  WAIT  until  <Max_ready  >  Consumed)  OR  (D_flag  is  set)  ; 

2)  Send  h  .  c  :=  Consumed  to  0_port  ; 

c 

3)  IF  (Consumed  =  n)  THEN  STOP  the  ASSEMBLER  process  ; 

4)  Consumed  :=  Consumed  +  1 

5)  EXIT  from  the  interrupt. 

Here,  the  following  notes  are  In  order: 

a)  The  interrupt  I  takes  place  when  a  new  Input  arrives  at  l_port.  The 
data  bandwidth  of  Lport  is  assumed  to  be  large  enough  to  input  the  sets 

h®  and  L®.  Ke<m.  K/<k. 

b)  The  interrupt  0  has  a  lower  priority  than  the  interrupt  I  and  it  takes 
place  when  0_port  Is  ready  to  receive  the  next  output.  The  data  bandwidth 
of  O.port  is  assumed  to  be  large  enough  to  output  a  set  h..  K/<n. 

c)  The  flag  D_flag  is  set  externally  when  the  input  of  all  the  elemental 
matrices  is  completed. 

It  is  straightforward  to  verify,  on  the  basis  of  Proposition  8.1.  that  the 
above  process  will  output  the  rows  of  H  to  0_port  only  when  they  are 
completely  assembled.  if.  moreover,  the  solution  process  is  able  to 


consume  the  rows  of  H  at  a  rate  faster  than  they  can  be  provided  at 
0_port.  then  we  do  not  need  to  provide  storage  for  the  n  rows  of  H.  and 
it  will  be  enough  to  provide  a  circular  buffer  to  store  B+1+K.  rows  of  H, 

D 

where  the  need  for  the  additional  buffer  K.  was  discussed  in  Section  8.3. 

D 

The  process  'ASSEMBLER'  is  assumed  to  handle  a  large  amount  of 
data  at  a  high  rate  due  to  the  large  input  and  output  data  bandwidths. 
Consequently,  we  may  need  more  than  one  processor  to  execute  this  pro¬ 
cess.  Fortunately,  the  'ASSEMBLER'  may  be  easily  decomposed  into  a 
number  of  parallel  sub-processes,  each  responsible  for  managing  the  data 
on  one  or  more  off-diagonals  of  the  matrix  H. 


Figure  8.5  -  A  parallel  architecture  for  the  assembler 
Consider,  tor  example,  the  extreme  case  where  we  have  B+l  sub¬ 
processes  ASSEMBLER_w',  w-0 ,  •  •  •  ,B  running  on  8  +  1  processor/memory 
units  (see  Figure  8.5).  Here  'ASSEMBLER_w'  manages  the  assembly  of  the 
wth  off-diagonal  of  H.  The  communication  network  in  Figure  8.5  distributes 

the  elements  of  h*  such  that  each  element  H*  ^  is  sent  to  the  P/M  unit 
responsible  for  its  accumulation.  More  precisely,  the  communication  network 

receives  with  each  element  hJ  .  the  global/local  map  information  v=g/ob(e,/) 
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(■ 


c> 


w- 


and  w=ABS  (glob  (e  .i)-glob  (e  ,i)h  (ABS=  absolute  value),  which  specifies  that 

A  C 

Hj  i  is  to  be  accumulated  in  the  w  off-diagonal  position  of  row  v  of  H. 
With  the  value  of  w.  the  network  then  sends  both  H?  .  and  v  to  P/M 

III  w 

which  completes  the  accumulation. 

One  difficulty  arises  from  this  decomposition  of  'ASSEMBLER',  namely 

that  upon  receipt  of  a  certain  row  fi?.  the  entries  of  this  row  will  be  distri¬ 
buted  on  the  few  P/M's  that  are  responsible  for  their  accumulation.  Hence. 


IS 


only  these  few  P/ M's  will  detect  the  arrival  of  and  update  accordingly 
their  copy  of  the  variable  'Max_ready'.  The  copies  of  Max_ready  in  the 
other  P/M's  will  not  be  updated  unless  we  provide  for  some  sort  of  inter¬ 
process  communication.  Since  P/MQ  receives  the  diagonal  element  of  every 

row  arriving  at  the  assembler,  it  seems  natural  to  have  only  P/MQ 
update  its  value  of  Max_ready.  and  then  send  a  message  to  the  other 
P/M's  with  the  new  value  of  Max_ready,  whenever  it  is  updated.  This  mes¬ 
sage  may  be  broadcast  to  the  other  processors  or  passed  from  one  pro¬ 
cessor  to  the  next  (a  daisy  chain).  With  this  observation,  we  may  describe 
the  process  in  any  processor/memory  unit  P/M*  as  follows 

Sub-process  ASSEMBLE R_w  : 

Max_ready  :=  0  ;  Consumed  :=  0  ;  diag_w()  :=  0  ; 

Interrupt  I:  /*  High  priority  */ 

1)  Get  H?  .  and  v=glob(e.i )  from  Import  ; 

1,1  w 

2)  diag_w(v)  :=  diag_w(v)  +  j  ; 

3)  FOR  P/Mq  ONLY  :  iF  (Max_ready  <  v-8-1)  THEN 


3.1)  Max_ ready  :=  v-B-1  ; 

3.2)  Send  messages  to  the  other  P/M's  with  the  new  value  of 
Max_ready  ; 


4)  EXIT  from  the  interrupt  ; 


Interrupt  0:  /*  Low  priority  V 

1)  WAIT  until  (Max_ready  >  Consume)  OR  (D_flag  is  set)  ; 

2)  Send  diag_w(Consumed)  to  0 ~Portw  : 

3)  IF  (Consumed  =  n)  then  STOP  the  ASSEMBLER_w  process. 

4)  Consumed  :=  Consumed  +  1 

5)  EXIT  the  interrupt  ; 

Interrupt  M:  /*  High  priority  V 

Update  the  local  copy  of  Max_ready  as  received  from  P/mq. 

here  the  high  priority  interrupt  M  takes  place  only  in  P/M^ ,  •  •  •  ,P/Mg  when 
a  message  is  received  from  P/MQ.  Similar  to  the  process  ASSEMBLER,  the 
linear  array  diag_wO  may  be  replaced  by  a  circular  buffer  of  length 
fl+1+K^.  Note  that  the  message-passing  communication  technique  may  be 
replaced  by  a  global  shared  variable,  or  by  letting  every  P/M  receive  an 

indication  that  a  row  ft?  has  been  received. 

Finally,  we  note  that  the  communication  network  of  Figure  8.5  may  be 
implemented  as  a  binary  tree  network  122]  where  the  nodes  at  consecutive 

levels  use  the  corresponding  bits  of  the  address  w  to  send  H®  ^  and  v  to 
either  its  left  or  right  successor  node,  with  l^port^.  w=0,***.B  placed  at 
the  leaves  of  the  tree.  We  do  not  intend  to  discuss  here  the  communica¬ 
tion  network  in  any  details. 

8.5.  Elimination  of  the  assembly  stage. 

To  our  knowledge,  Belytschko  and  ai.  [31  were  the  first  to  notice  that 
the  multiplication  of  the  global  matrix  H  by  a  vector  p  can  be  completed 
without  assembling  H.  More  precisely,  from  equation  (6.8)  we  have 
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m  _ 

H  p  =  E  Ar  H9  M6  p 
e=l 

m  _ 

=  E  «®r  H*  p9 

e=l 

where  p9  =  M9  p  is  a  k-dlmenslonal  vector  containing  those  entries  in  p 
that  correspond  to  the  nodes  belonging  to  the  finite  element  e.  that  is. 
P®</)  *  p(glob(e ./)).  for  /=.l  •••.*. 

The  following  algorithm  describes  precisely  the  use  of  the  unassembled 
matrices  H  .  e=! ,  •••,/n  in  the  computation  of  the  product  vector 


>. 

y  =  x 

+  H 

p.  where  x  and  p 

are  given 

i* 

ALG9 

s 

s 

1) 

FOR 

/=!.♦••./»  DO 

1.1) 

y(l)  =  xd) 

ft 

2) 

FOR 

e =1 .  •  •  -  .m  00 

A' 

2.1) 

form  the  vector  p9 

from 

P  (/)  =  p(g/ob(e,/)) 

2.2)  Obtain  the  k-dimensionai  vector  product 


Y°  =>?  P9 


/  =  !.' 


.k 


2.3)  Accumulate  y  into  y  according  to 

y(g/ob(e./)>  =  y(g/ob(e./)>  t  y®(/')  /'  =  !,•  •  •  ,/t 

The  above  algorithm  is  very  useful  If  an  iterative  solver  is  to  be  used 
for  solving  the  linear  system  of  equations  H  u  =  b  resulting  from  the  finite 
element  approximation.  In  fact,  as  noted  before,  most  iterative  solvers 
involve  the  matrix  H  only  for  the  computation  of  its  product  with  some  given 
vectors  and  this  can  be  done  without  assembling  the  global  matrix  H. 
Clearly.  ALG9  is  especially  suitable  for  our  systolic  system  where  the  gen¬ 


eration  of  the  matrices  H  is  pipelined  for  e=l.**,.m.  and  hence  allows 
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also  for  pipelinlg  the  formation  of  the  partial  products  y  . 

it  is  widely  accepted  in  parallel  processing  that  one  may  replicate  some 
parts  of  the  computation  or  apply  an  algorithm  that  may  not  be  efficient  for 
sequential  processing,  provided  that  the  gain  obtained  from  parallelism  justi¬ 
fies  the  added  cost.  This  may  be  the  case  in  the  computation  of  the  pro¬ 
duct  vector  y  =  Hp .  where  the  direct  computation  does  require  less  work 
than  the  application  of  ALG9.  Also  it  seems  obvious  that  the  amount  of 
storage  required  to  store  the  matrix  H  is  less  than  that  required  for  storing 

ail  of  the  elemental  matrices  H9 .  e=1  .•••.m.  In  order  to  justify  the  appli¬ 
cation  of  ALG9.  we  will  compare  the  storage  requirement  for  storing  H  with 
that  of  storing  all  the  H9 .  e=l  ....m.  and  the  work  required  to  compute  Hp 
directly  rather  than  by  ALG9. 


a)  k=4 


OJ  k=8 


c)  k=9 


d)  k=16 


Figure  8.6  -  A  uniform  finite 
element  mesh. 


Figure  8.7  -  Some  quadrilateral 
element  types 


As  a  basis  for  the  comparison,  we  consider  the  matrix  H  corresponding 
to  a  uniform  finite  element  mesh  over  a  rectangular  domain  (see  Figure 
8.6).  The  number  of  elements  in  the  horizontal  and  vertical  directions  is 


assumed  to  be  m  and  m  .  respectively.  Hence  the  total  number  of 
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elements  is  m  *  m . .  We  shall  consider  four  different  types  of  quadri¬ 

lateral  elements  (see  Figure  8.7).  namely:  a)  four  node  bilinear  elements,  b) 
eight  node  serendipity  elements,  c)  nine  node  lagrangian  elements,  and  d) 

sixteen  node  lagrangian  elements,  it  is  easy  to  check  that  the  total  number 
of  nodes  n  in  the  finite  element  mesh  for  the  four  types  is  aHm^tiHm^+l), 
b)(2mrt +1)(2myt1)-m^my .  c)(2m^ +1)(2my +1)  and  d)(3mft +l)(3my +1).  respec¬ 
tively. 

it  should  be  noted  here  that  H  Is  a  banded  sparse  matrix,  and  hence 

that  the  cheapest  way  of  storing  it  and  operating  on  its  non  zero  elements 
is  to  use  a  sparse  storage  mode  [181.  where  the  non  zero  elements  of 

each  row  of  H  are  stored  in  consecutive  locations  in  a  linear  array  ELEM. 
If  N  is  the  number  of  non  zero  entries  in  H.  then  the  length  of  ELEM 
should  be  at  least  equal  to  N.  in  addition,  two  further  integer  arrays  are 

needed:  the  first  has  at  least  N  elements  to  store  the  column  number  of 
the  corresponding  entries  in  ELEM.  and  the  second  contains  n  pointers  to 
ELEM.  These  pointers  specify  the  position  in  ELEM  of  the  first  element  in 
each  of  the  n  rows  of  H.  Hence  the  minimum  storage  requirement  for  H 


S1  =  (cr  +  cf)  N  t  cf  n 

where  cr  and  c(  are  the  cost  of  storing  a  real  number  and  an  integer, 
respectively. 

On  the  other  hand,  each  elemental  kxk  matrix  H9  has  to  be  accom¬ 
panied  by  an  integer  vector  to  indicate  the  global  label  of  each  local  node 
(e.i).  /=!.•••.*.  Hence,  the  total  storage  for  the  unassembled  matrices  is 


$2  =  m  k  cf  +  m  k  c( 


Assuming  that  the  cost  of  storing  a  real  number  is  double  the  cost  of 
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storing  an  integer  we  obtain  the  ratio 

S, 


'1 


3/V  +  fl 
5 

2m  k  t  mk 


In  order  to  compare  the  computational  cost  for  the  direct  product  Hp 

and  ALG9.  we  assume  that  the  costs  of  a  multiplication  and  an  addition  are 

wm  and  wa .  respectively.  For  the  direct  product,  we  include  the  number  of 

2 

operations  required  to  assemble  the  matrix  H  (.mk  additions),  and  hence 
obtain 


W.  -  (w  +  w  )  N  +  m  k  w  . 
\  m  a  a 


For  ALG9.  this  cost  Is 


W  =  (w  +w)mk  +mkw. 
2  m  a  a 


Neglecting  the  second  terms  in  and  we  obtain  the  pessimistic  ratio 


1 


W, 


N 
mk 1 


u 


r . 


In  Figure  8.8.  we  assume  that  and  m^  are  much  larger  than  one. 
and  list  the  estimated  formulas  for  m,  n  and  N  as  well  as  the  ratios  S-|/S2 
and  W^/\N2  for  the  four  types  of  elements  of  Figure  8.7.  and  the  matrix  H 
corresponding  to  the  finite  element  mesh  of  Figure  8.6. 


Element  type 

k 

m 

n 

N 

VS2 

W^/W2 

a 

4 

m  .m 
f)  V 

m.m 
h  v 

9  mhmy 

0.777 

0.562 

b 

8 

m.m 
h  v 

3  m.m 
h  v 

47  mhmy 

1.059 

0.734 

c 

9 

m.m 
h  v 

4  m.m 
h  y 

64  m.m 
h  y 

1.146 

0.790 

d 

16 

m.m 
h  v 

9  m.m 
h  y 

225  m.m 
n  v 

1.295 

0.849 

Flours  8.8  -  ComDarison  of  ALG9  with  direct  multiplication 


The  value  of  the  ratio  S^/Sg  indicates  that  for  elements  of  order 
higher  than  the  four-node  type,  the  overhead  associated  with  the  sparse 

storage  of  H  makes  it  cheaper  to  store  the  unassembled  matrices  H0 . 

e=1  .•••./n.  It  also  turns  out  that  any  banded  or  profile  scheme  for  storing 

H  will  require  more  storage  than  the  sparse  scheme  assumed  here.  On 

the  other  hand,  the  cost  of  executing  ALG9  is  always  higher  than  that  for 
the  direct  multiplication  of  H  and  p.  However,  the  ratio  W^/W2  indicates 
that  the  additional  work  in  ALG9  is  relatively  small,  especially  for  higher 
order  element  types.  This  suggests  that  we  may  be  willing  to  pay  this 

price  in  return  for  the  speed-up  and  the  elimination  of  memory  fetch  gained 
from  our  pipelined  systolic  system  of  the  following  chapter. 


TSTOTO 
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9.  POSSIBLE  CONFIGURATIONS  OF  A  COMPLETE  FINITE  ELEMENT  SYSTEM. 

Our  primary  goal  in  this  chapter  will  be  to  show  that  the 
pipeiined/systotic  approach  may  be  applied  to  the  design  of  a  complete  fin¬ 
ite  element  system.  We  do  not  Intend  to  specify  the  details  of  an  ultimate 
system  nor  to  compare  different  possible  designs.  Instead,  we  will  identify 
the  basic  functional  units  in  a  complete  system  and  then  refer  to  possible 
implementations  for  these  units,  with  the  emphasis  on  the  interface  and 
interaction  between  them. 

By  its  very  nature,  any  systolic  or  pipeline  network  needs  to  be  moni¬ 

tored  by  a  host  computer.  In  our  system,  the  host  is  assumed  to  be  a 
general  purpose  computer  that  contains  the  data  base  for  the  problem  to 
be  solved,  it  constitutes  also  the  only  means  of  communication  between  the 
user  and  the  system,  through  which  the  user  specifies  or  updates  the  infor¬ 
mation  about  the  finite  element  mesh  and  the  partial  differential  equation 
and  in  turn  obtains  the  results  of  the  analysis.  The  host  resembles  the 
heart  of  our  systolic  finite  element  system.  it  is  responsible  for  setting, 

initiating  and  feeding  the  systolic  pipe  with  the  appropriate  data  as  well  as 
for  collecting  the  output  data  and  performing  some  additional  tasks  that  we 
wifi  discuss  later. 

A  basic  functional  unit  that  should  be  included  in  any  pipelined  finite 
element  system  is  a  unit  for  the  generation  of  the  elemental  arrays.  Its 
tasks  may  be  identified  as  follows:  a)  generate  the  arrays  H9  and  b9  for 
the  elements  e=l.--*,m,  b)  update  H 9  and  b9  for  some  elements  to  force 

the  solution  to  be  equal  to  zero  at  some  portions  of  the  boundary,  c)  add 

the  effect  of  the  essential  boundary  conditions  to  H9  and  b9 .  and  d) 


».  *•_*  ».  v.  v.  v\ 


.-;-•  C: 


associate  with  each  entry  of  H°  and  p®  the  position  at  which  this  entry  is 
to  be  accumulated  in  the  global  arrays  H  and  b. 


HOST 


O..  K» 


Vl  H  Vl  H  H  N6 


Figure  9.1  -  The  generation  of  the  elemental  arrays 
The  above  tasks  may  be  executed  using  the  systolic  networks  described 
in  Chapters  7  and  8.  In  fact,  task  (a)  may  be  executed  on  the  networks 
N1  •  •  •  N6  of  Chapter  7.  tasks  (b)  and  (d)  on  the  network  N7  of  Section  8.1 
and  task  (c)  on  the  network  described  in  Section  8.2  (call  it  N8).  These 
networks  were  designed  such  that  when  connected  as  a  pipeline  (see  Figure 
9.1).  the  output  of  each  sub-network  is  the  input  to  the  next  one. 

Before  initiating  the  operation  of  the  system,  the  host  should  compute 

the  line  integrals  that  account  for  the  essential  boundary  conditions  and 

load  them  into  the  local  memories  for  N8.  It  should  load  also  the  quantities 
0  1  2 

v.  (g).  A.Cg)  and  A.  (g),  /=!,••  *.k.  g  =  l,**-,g  into  the  local  memory  LM. 
Then,  the  host  initiates  the  operation  and  pumps  the  input  data  for  the  dif¬ 
ferent  finite  elements  through  the  system  along  the  proper  input  links.  The 
form  of  the  inputs  may  be  described  by: 


<E?  e2  <f> 


/  =  1 .2 


„  q  +3k  +9  d3 k  ,  03  o 

Vat!  *  "  > 

°9.otl  "  e  =  1  1  W  > 


(9.1. a) 


(9.1.b) 


(9.1.C) 


(9.l.d) 


P9.q  +  1  = 

P3A+9,q  +  1 
°0.q+4'  = 
^o.a+4  = 


n,*3*t9  p3»  p3  ,  , 

e=l.m  k  2 

=  n“I*9*t9  p3*  92  J 

a  =  l  .m  1 

na+6/f+16  P3^^(  02ye} 

qt6*  +  16  „3/f  fl2  », 

n  Pe=l.m(  0  V 


<P  ) 


(9.1.6) 

(9.1.0 

(9.1.g) 


Here,  the  operator  P  .  indicates  that  the  data  for  the  different  finite 

e  =  1./7J 

elements  are  pipelined  with  a  pipe  separation  of  3k.  and  the  sequences  with 
superscript  e  contain  the  relevant  information  about  the  finite  element  e. 
For  the  precise  definition  of  these  sequences  we  refer  to  the  corresponding 
sections  of  Chapters  7  and  8.  However,  we  may  describe  informally  the 
content  of  these  sequences  as  follows: 

*  /=1.2  contains  the  coordinates  of  the  nodes  in  element  e  (2k 
data  items). 

*  af.  f  =0,1.2  contains  the  coefficients  a0  r  r./=0.1.2  of  the  bilinear 
form  in  element  e  (9  data  items), 

*  y>®  contains  a  single  data  item;  the  load  fe . 

*  y0  contains  the  k  global  labels  of  the  nodes  in  e  (k  data  items), 
and  finally 

*  -y®  contains  k  data  items  that  specify  the  nodes  at  which  the  solu¬ 
tion  must  be  zero. 

Hence,  if  there  are  input  buffers,  the  host  should  supply  the  systolic 
pipe  with  4k+10  data  items  every  3k  time  units.  This  is  a  relatively  low  rate 
which  can  be  achieved  even  by  a  microcomputer.  With  the  help  of  the 
abstract  systolic  model,  we  were  able  to  prove  that  if  the  input  to  the  sys¬ 
tolic  pipe  is  as  given  by  (9.1).  then  the  system  will  produce  the  elemental 
arrays  on  the  output  links  zy  and  b u  g+g.  u=0.  •  •  •  .k  of  N8  according 


to  the  formulas 


.  _  q+6u+3/f  +  18  p3k  2 

Vq+6  '  n  e=l.m  (0 


V 


(9. 2. a) 


a  -  rtq+6u+3/<+18  n3/c  .  .,1.1.1  „*2*  e  2-e  * 

*■  —  -  n  Pe=l.m<  W1  (e  ^  •  ne2xj  .  0  )  ) 


u  .<7+6 


(9.2. b) 


where  X  =0  .  f(5®)=  T(X®)=  k,  7-(JZ®)=  T(X®)=  T(X®)=  k-u  for  e-0.- 


and 


M  (f)  = 
u 


r  Hu+u 


t 


u=  0.  •  •  •  ,k-l 


u  =k 


X®  =  globle.t ) 

X^  =  glob  le  .t+u)  -  globle.t ) 


o=0,  *••,/< 
o=0.  •  •  *  .*-1 


In  other  words.  5®  contains  the  elements  of  H®  and  5®.  and  X®.  X®  con- 
o  o  o 

tain  the  addresses  where  these  elements  are  to  be  accumulated  in  the  glo¬ 
bal  arrays  H  and  b. 


As  noted  earlier,  in  order  to  Integrate  the  system  of  Figure  9.1  into  a 
complete  finite  element  system,  we  must  distinguish  between  two  types  of 
systems  according  to  the  method  used  for  solving  the  resulting  linear  sys¬ 
tem  of  equations 


H  o  =  b 


(9.3) 


These  are  either  systems  that  employ  direct  solvers  for  the  solution  of  (9.3). 
or  systems  that  use  an  iterative  scheme  for  completing  the  analysis.  We 
will  consider  the  systems  with  the  different  types  of  solvers  separately. 


9.1.  Systems  that  employ  direct  solvers. 


in  Figure  9.2.  we  show  a  block  diagram  of  a  complete  system  that 
uses  an  LU  decomposition  for  solving  (9.3).  it  consists  of  the  host  and 


four  functional  units.  The  unit  labeled  GEN  is  the  generator  of  the  elemen¬ 
tal  arrays  as  described  earlier  in  some  detail.  The  output  of  GEN 
(described  by  equations  (9.2))  is  then  directed  into  the  unit  labeled  ASSEMB. 
Its  function  is  to  assemble  the  global  arrays  H  and  b.  The  third  unit. 
FACT,  receives  H  and  b  from  ASSEMB  and  simultaneously  performs  the  LU 
factorization  and  produces  the  solution  y  of  Ly=b. 


Figure  9.2  -  A  block  diagram  of  a  system  with  a  direct  solver 

Because  of  the  high  rate  at  which  ASSEMB  receives  its  inputs  and 
hence  produces  the  elements  of  H  and  b.  FACT  should  have  enough  com¬ 
puting  power  to  process  the  data  at  such  a  rate.  This  power  may  be 
obtained  from  a  very  high  speed  array  processor  that  may  become  available 
in  the  future  as  a  result  of  advances  In  VLSI  and  optical  communication 
technologies.  However,  with  the  current  technology,  the  most  suitable  can¬ 
didates  for  the  implementation  of  FACT  are  systolic  arrays. 

Assume  that  ASSEMB  is  implemented  as  B+1  processor/memory  units  as 
described  in  Section  8.4.  Hence,  each  unit  o <w<B.  will  produce  at 

its  output  port  0_portw  the  elements  of  the  wfrt  off-diagonal  of  the  assem¬ 
bled  matrix  H ,  in  order.  We  may  also  use  an  additional  processor/memory 
unit  P/Me  +  1  to  assemble  the  global  load  vector  and  produce  its  elements 

0+1  ■ 


at  the  corresponding  port  0_port 
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it  had  already  been  noticed  in  Section  8.3.  that  there  is  no  uniformity 
in  the  rate  at  which  ASSEMB  produces  the  assembled  rows  of  H .  For  any 
time  t.  this  rate  was  denoted  by  rp (t).  in  order  to  obtain  an  upper  bound 
on  rp(t).  we  assume  that  the  nodes  of  the  finite  element  mesh  are  num¬ 
bered  by  means  of  algorithm  ALG8.  Note  that  the  km  input  rows  of  ASSEMB 

are  generated  by  GEN  at  a  uniform  rate  of  one  row  every  three  time  units, 

that  is.  according  to  the  terminology  of  Section  8.3.  we  have  r.=l  Then. 

I  O 

by  Proposition  8.3,  rp(t)  cannot,  at  any  time  t.  exceed  rf.  In  other  words, 
we  have  rp(t)  <  r/  =  3'  which  means  that  the  average  rate  7  cannot 

exceed  ^  A  more  precise  estimate  of  7p  is  obtained  by  noting  that 

ASSEMB  receives  km  input  rows  at  a  uniform  rate  r.=i  and  produces  n 

I  w 

output  rows  at  an  average  rate  7  Hence.  7  =  <  -i 

P  P  oK/71  3 

With  this  implementation  of  ASSEMB.  The  systolic  network  of  Section  4.3 
may  be  used  to  implement  the  functional  unit  FACT.  From  now  on.  we  will 
refer  to  this  network  as  Sys_FACT.  If  synchronized  by  a  global  clock. 

Sys-FACT  expects  to  receive  its  input  (the  n  rows  of  H)  at  the  rate  of  one 
row  every  three  time  units,  that  is  the  rate  r  =4  row/time  unit.  As  men- 
tioned  earlier,  with  rc>~p ■  ASSEMB  will  not  be  able  to  supply  Sys-FACT 
with  its  inputs  on  time,  and  hence,  we  are  forced  to  implement  Sys-FACT 
as  a  self  timed  systolic  network  rather  than  as  a  network  synchronized  by  a 
global  clock.  More  precisely,  if  an  input  cell  C^  of  Sys-FACT  is  connected 
to  the  processor/memory  unit  in  ASSEMB  and  is  expected  to  receive 

from  it  the  elements  of  the  wth  off-dlagonal  of  H.  then,  whenever  C  is 
ready  to  accept  the  next  input  item,  it  sets  the  0_interrupt  in  P/Mw  and 
holds  its  operation  until  P/M*  puts  the  required  item  on  the  output  port 

0_port.  For  further  details  on  the  principle  of  operation  of  self  timed  sys¬ 
tolic  networks,  we  refer  to  the  discussion  in  Section  5.5. 
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in  order  to  estimate  the  storage  requirement  in  ASSEMB.  we  note  again 

that  rc“5  and  use  the  result  that  rp(f)<3  t0  conc,ud®  that  the  r°ws  of  H 
will  be  consumed  by  Sys.FACT  as  soon  as  they  are  produced  by  ASSEMB 
(become  B.ready).  Hence,  the  only  storage  needed  in  ASSEMB  is  for  the 
8  +  '  active,  but  not  B_ready.  rows  of  H  that  are  being  assembled  at  any 
one  time. 

The  self  timing  synchronization  of  Sys.FACT  makes  it  impossible  to 
predict  the  time  at  which  every  element  of  the  upper  triangular  matrix  U 
and  the  partial  solution  vector  y  (Ly=b)  will  be  produced  on  the  output  links 
of  Sys_FACT.  However,  we  know  that  Sys_FACT  does  generate  the  elements 
in  the  last  row  of  U  (and  y)  one  time  unit  after  it  receives  the  correspond¬ 
ing  elements  in  the  last  row  of  H  (and  b).  Also  the  last  B  rows  of  H. 
namely  hn  _fl ,  •  •  •  ,hn .  are  made  available  to  Sys_FACT  just  after  ASSEMB 
receives  its  last  Input  at  time  3fcm+9fc+o+16.  Since  Sys_FACT  is  able  to 
consume  these  B  rows  in  3B  time  units,  we  may  conclude  that  the  factori¬ 
zation  and  partial  solution  will  be  completed  at  time  3km +9k +38+0+17- 
3  (km  +8). 

Finally,  we  discuss  the  last  functional  unit  in  Figure  9.1.  This  unit. 
BACK,  performs  the  last  step  in  the  analysis,  namely  the  solution  of  the  tri¬ 
angular  system  Uu=y  by  back  substitution.  Although  its  task  is  simple. 
BACK  cannot  start  its  computation  before  the  last  row  of  L  and  the  last 
element  of  y  are  available.  Hence  a  temporary  storage  (TEMP  in  Figure 
9.2)  must  be  provided  for  storing  these  elements  upon  their  generation  from 
Sys_FACT  until  all  the  element  of  L  and  y  are  generated. 

The  systolic  network  for  back  substitutions  described  in  [31]  may  be 
used  for  BACK.  However,  we  may  also  use  Sy$_FACT  to  perform  the  back 
substitution  as  described  in  Section  4.3.  Although  this  may  be  very  ineffi- 


cient.  it  eliminates  the  need  for  any  separate  hardware  for  BACK,  provided 

that  the  entire  system  will  not  be  used  to  pipeline  the  computations  for 

more  than  one  finite  element  problem.  In  any  case,  the  back  substitution 

will  not  require  more  than  3n  time  units,  and  hence  the  entire  analysis  will 

be  completed  in  approximately  3 (n+km+B)  time  units,  which  is  a  consider- 

2 

able  speed  up  over  the  time  for  serially  executing  the  O  in  )  operations 
estimated  in  [9}.  No  comparison  can  be  made  with  the  parallel  finite  ele¬ 
ment  machine  of  ICASE  [29]  because  the  latter  cannot  use  direct  solution 

schemes. 

The  system  described  above  profits  from  all  apparent  concurrencies  in 
the  finite  element  analysis.  However,  it  has  a  serious  disadvantage,  namely 
the  architectures  of  its  units  ASSEMB  and  FACT  depend  on  the  bandwidth  B 
of  the  matrix  H.  which  varies  from  problem  to  problem.  This  disadvantage 

is  shared  with  most  of  the  known  systolic  networks  that  operate  on  banded 
matrices  131.7.34].  In  order  to  be  able  to  use  a  system  designed  for  a 
certain  bandwidth  6  on  a  problem  with  a  larger  bandwidth  B'>B.  we  should 
be  able  to  partition  the  computation  appropriately  to  allow  its  execution  on 
the  existing  hardware.  ASSEMB  can  easily  accommodate  a  partloning  in 
which  each  P/M^  performs  the  computation  associated  with  more  than  one 
off-diagonal  of  H.  However,  the  complex  communication  pattern  of 

Sys_FACT  makes  its  partloning  non-trivial.  More  research  is  needed  on 
Sys_FACT  or  alternate  architectures  for  the  direct  solution  of  (9.3)  if  we 
desire  to  have  a  system  that  Is  independent  of  the  bandwidth  B.  Pipelined 
finite  element  systems  that  employ  iterative  solution  schemes  do  not  appear 
to  share  this  particular  disadvantage. 

Remark: 

The  validity  of  7  <r  was  based  on  the  assumption  that  the  degree  of 


freedom  d  per  node  is  one  (see  Remark  1  in  Chapter  6).  This  relation 
may  change  if  d  is  larger  than  unity.  In  order  to  be  more  specific,  we 
assume  that  Remark  3  in  Chapter  7  is  applied  so  that  GEN  generates  the 


d2  elements  of  each  entry  in  H6  simultaneously.  We  also  assume  that 
ASSEMB  is  capable  of  assembling  these  elements  as  soon  as  they  are 
received,  and  hence  that  the  execution  time  of  the  assembly  stage  remains 
equal  to  3km  time  units.  In  this  case,  the  nd  rows  of  the  global  matrix  H 
are  produced  by  the  assembler  at  the  rate  ~p  =  rows/time  unit.  On 
the  other  hand,  the  rate  at  which  Sys-FACT  can  consume  the  B_ready  rows 

1  ~  d 

remains  rp  =  .  and  hence  the  ratio  -2-  -  is  larger  than  one  except 

c 

for  small  values  of  d  (see  e.g.  Figure  8.8).  This  causes  the  storage 
requirement  for  the  assembly  process  to  increase  without  restrictions. 

In  order  to  limit  the  size  of  the  storage  requirement,  we  have  to  slow 
down  the  rate  at  which  the  elemental  arrays  are  generated.  This  is  possible 
by  using  one  of  the  following  two  techniques: 

1)  A  modification  of  the  control  parameters  in  GEN  as  described  in  Remark 
2  of  Chapter  7.  Now.  each  elemental  array  is  generated  in  ck  time  units 


rather  than  3k  time  units,  where  c  may  be  chosen  such  that  ■= 

c 

3 nd  ,  , 


2)  The  use  of  a  self  timed  technique  for  the  synchronization  of  the  systolic 
system  GEN.  With  this,  a  fixed  storage  may  be  used  in  ASSEMB  and 
whenever  this  storage  is  fully  utilized.  ASSEMB  stops  consuming  the  output 
of  GEN,  thus  forcing  it  to  a  temporary  halt  until  Sys-FACT  consumes  some 
of  the  rows  in  ASSEMB.  This  alternative  is  preferred  as  it  adjusts  the  rate 


r  automatically  and  efficiently.  In  this  case.  Sys_FACT  becomes  the  bottle 


neck  of  t he  system  and  hence  the  time  for  the  completion  of  the  entire 
computation  becomes  approximately  6 nd  time  units,  where  3 nd  units  are 
consumed  by  GEN  and  FACT,  and  the  other  3 nd  units  by  the  back  substi¬ 
tution  step. 

9.2.  Systems  that  employ  Iterative  solvers. 

Direct  solution  schemes  for  the  linear  system  (9.3)  do  not  take  advan¬ 
tage  of  the  fact  that  the  stiffness  matrix  H  is  highly  sparse.  This  nice  pro¬ 
perty  is  lost  in  part  during  the  factorization  process,  thus  missing  a  poten¬ 
tial  for  savings  in  both  the  storage  and  the  execution  time.  For  this  rea¬ 
son.  it  is  sometimes  beneficial  to  use  iterative  schemes  for  the  solution  of 
(9.3)  despite  their  obvious  disadvantages,  namely,  the  absence  of  a  good 
criterion  for  chosing  the  initial  point  and  the  possible  divergence  or  slow 
convergence  of  the  iteration. 

Many  iterative  schemes  exist  for  the  solution  of  (9.3).  We  consider 

here  only  two  schemes  that  are  widely  used  in  conjunction  with  the  finite 
element  method,  namely  the  conjugate  gradient  method  and  the  multi-grid 
technique. 

9.2.1.  The  Conjugate  Gradient  method. 

This  method  was  originally  proposed  by  Hestenes  and  Stiefel  [21 J.  It 

finds  the  solution  of  the  linear  system  of  equations  Hu=b  by  determining 

*  ITT 

me  minimum  u  of  its  gradient  functional  g(u)=  ^  u  Hu  -  b  u.  The 

method  starts  with  an  initial  guess  u°  and  obtains  a  sequence  of  approxi- 

1  2  *  th 

mations  u  ,u  ,  •  •  •  to  u  iteratively.  At  the  /  iteration  step,  a  new 

approximation  u/  +  1  is  obtained  from  the  previous  one  u  by  the  addition  of 

a  step  size  s'  along  a  suitable  direction  p1  that  reduces  g(u).  The 

method  may  be  more  specifically  described  by  the  following  algorithm 
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ALG10.  where  <x.y>  denotes  the  inner  product  xT y  and  lx  I  is  a  suitable 

vector  norm.  as  for  example.  the  infinity  norm  defined  by 

lx  I  =  maxf  x^;  for  any  n-dimensional  vector  x.  The  Iteration  is 

forced  to  halt  If  it  does  not  converge  within  f  iterations. 

max 

ALG10 

INPUT  u°.  H.  b  and  an  acceptable  tolerance  €. 

1)  r°  =  b  -  Hu° 


Note  that  step  2.7  in  ALG10  may  be  replaced  by  other  stopping  criteria 
and  that  some  tests  may  be  added  for  the  detection  of  any  divergence  in 
the  iteration.  Note  also  that  most  of  the  work  in  ALG10  is  in  steps  1  and 
2.3  where  a  matrix-vector  multiplication  Is  performed.  The  computations  in 
the  other  steps  are  simple  vector  or  scalar  operations. 

The  convergence  properties  of  the  method  may  be  improved  by  a  tech¬ 
nique  called  'preconditioning',  where  the  solution  of  Hu=b  is  obtained  by 
solving  another  linear  system  Hu=b  that  converges  faster  than  the  original 
one  The  transformation  between  H.  u  and  b  and  H ,  u  and  b  usually  is 
simple  and  relatively  straightforward.  For  a  detailed  description  of  the 
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preconditioning  techniques  we  refer  to  [12]. 

The  second  method  that  may  be  used  for  solving  (9.3)  is  the  multi-grid 
method. 

9.2.2.  The  Multi -Grid  method. 

The  basic  philosophy  of  this  method  [38.6]  is  that,  in  an  iterative 

scheme,  the  amount  of  computation  at  each  step  should  be  proportional  to 
the  gain  obtained  from  it.  In  order  to  be  more  specific,  we  denote  the 

finite  element  grid  (mesh)  by  GQ  and  the  corresponding  stiffness  matrix  and 

load  vector  by  HQ  and  bQ.  respectively.  Also,  let  be  the 

sequence  of  approximate  solutions  of  HqU=Pq  generated  by  a  given  iterative 
scheme. 

At  the  first  few  steps  of  the  iteration,  the  residual 

decreases  rapidly  from  one  Iterate  to  the  next,  but  soon  after,  the  conver¬ 
gence  rate  levels  off  and  becomes  very  slow.  Closer  examination  [6]  of  the 
Fourier  expansion  of  the  residual  (the  error)  shows  that  the  convergence  is 
fast  as  long  as  the  residuals  have  strong  fluctuations  on  the  scale  of  the 

the  grid  GQ.  and  that  this  rate  slows  down  when  the  residuals  are 

smoothed  out.  At  this  point.  It  is  more  beneficial  to  reduce  GQ  into  a 

coarser  grid  G^  and  continue  the  computation  on  G^  This  has  two 

advantages,  namely  1)  the  relative  fluctuation  of  the  error  will  increase  when 
measured  with  respect  to  the  coarse  grid  G^.  thus  speeding  up  the  process 
of  eliminating  the  error  components  that  were  decreasing  slowly  on  GQ.  and 
2)  the  cost  of  the  Iteration  steps  will  decrease  due  to  the  reduction  of  the 
number  of  elements  and  nodes  and  hence  of  the  size  of  the  system.  This 
idea  may  be  expanded  by  using  a  sequence  of  grids  Gq.G^.-**  where 
each  grid  G,  is  coarser  than  G/_1 .  Note  that,  in  addition  to  the  applica¬ 
tion  of  a  specific  iterative  scheme,  the  multigrid  solution  process  involves 


some  data  transformation  between  the  different  grids. 

For  a  more  specific  outline  of  the  process  suppose  that  a  sequence  of 
fine  to  coarse  grids  Gq.G^***  has  been  given.  The  number  of  nodes  in 
each  grid  G.  is  denoted  by  n f  and  hence,  any  vector  defined  on  G.  is  a 

nt 

member  of  the  vector  space  ft  .  the  desired  solution  u  of  HQu  =  bQ 
corresponding  to  the  finest  grid  is  obtained  from  an  initial  guess  u°  by  the 

recursive  application  of  the  following  algorithm  ALGU.  starting  with  i=0  and 

0  0 
uQ=u  . 

ALG11 

INPUT:  A  grid  G..  the  corresponding  matrix  Hj  and  right  side  b.  and 
0  nl 

an  initial  point  u”  eft 

1)  Use  an  iterative  scheme  (e.g.  the  conjugate  gradient  scheme)  to 
compute  a  sequence  of  approximate  solutions  uj  ,u^ .  •  •  • .  Stop  the 
iteration  when  the  rate  of  convergence  becomes  smaller  than  a  cer¬ 
tain  acceptable  value.  Let  uf  be  the  last  obtained  approximation. 

2)  Compute  the  residual  r  =  b.  -  HjU. 

3)  Consider  the  next  grid  G.  +  1  and  obtain  the  corresponding  residual 

n.  . 

r/  +  1cft  1  on  G/  +  1  by  appropriately  averaging  the  components  of  r.. 
Obtain  also  the  corresponding  stiffness  matrix  H  ^ 

4)  Find  the  solution  of  the  lower  dimensional  system 

=  f,+i;  IF  '  +  !<*•  THEN  invoke  ALGll  recursively.  ELSE, 
solve  =  rk  exactly  by  means  of  a  direct  solver. 

5)  Interpolate  A/+1  back  from  G.  +  1  to  G..  Denote  the  interpolated 

nl 

vector  by  (A.eR  >. 

*  A 

6)  Set  the  solution  u  -  u.  t  a.. 

The  averaging  operation  In  step  3  is  taken  to  be  the  dual  of  the  inter- 


r. 


t 


>v  - . 
,W  SI- 


A 

■  v 


potation  operation  of  step  5.  That  Is.  If  we  denote  by  /'.  the  linear 
operator  used  to  obtain  r,  +1  from  r. .  and  by  /+1  the  linear  operator  used 
to  interpolate  A/f1  to  A^.  then  the  two  operators  are  related  by 
/j+1  *  c  (/|+1)^.  with  some  constant  c. 

Finally,  we  note  that  the  matrix  corresponding  to  the  grid  G/+1. 

may  be  obtained  either  directly  by  using  ALG5  of  Chapter  7.  or  from  the 
relation  H/+1  =  /j +1  H.  /j  ^ .  it  can  be  seen  that  the  two  approaches  are 
equivalent. 

After  having  introduced  some  possible  iterative  schemes,  we  describe 
next  their  application  in  the  context  of  a  systolic/pipelined  finite  element 
system. 


9.2.3.  An  iterative  systolic  finite  element  system. 

in  what  follows,  we  will  assume  that  the  Iterative  scheme  used  to  solve 
(9.3)  involves  the  matrix  H  only  in  the  computation  of  Its  product  with  a 
certain  vector.  it  was  shown  in  Section  8.5  that  this  product  may  be 
formed  using  the  unassembled  elemental  arrays,  thereby  eliminating  the  need 
for  the  irregular  assembly  stage.  Moreover,  the  system  described  in 

Chapter  7  pipelines  the  computation  of  the  elemental  arrays 

Hence,  for  a  given  vector  p,  the  calculation  of  the  partial  product  vectors 

y1=HV,,,,.ym=Hmpm.  may  also  be  pipelined,  where  and 

are  as  described  In  ALG9  of  Section  8.5. 

The  multiplication  ye=Hepe.  e=l.***.m  may  be  performed  on  the  sys¬ 
tolic  network  described  in  Section  3.1.  This  network  is  shown  in  Figure  9.3 
after  relabeling  its  nodes  in  a  manner  consistent  with  the  networks  that 
generate  the  elemental  arrays.  The  additional  row  of  cells  shown  in  the 
figure  is  used  to  prepare  the  output  of  GEN  in  the  form  required  for  the 
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Figure  9.3  -  A  matrix/vector  multiplication  network 
proper  operation  of  the  multiplication  network.  More  descriptively,  the  input 
links  zu  q+Q.  u  =0,  •  •  •  ,/c -1  are  connected  to  the  corresponding  output  links 
in  the  subnetwork  N8  of  Figure  9.1.  Hence,  the  data  sequences  on  these 
links  are  described  by  (9. 2. a).  However,  by  comparing  (9. 2. a)  with  the  for¬ 
mula  (3.11. a),  given  in  Section  3.1  for  the  input  of  the  multiplication  net¬ 
work.  it  is  clear  that  the  elements  of  the  uth  off-diagonal  of  the  multiplied 
matrix  should  be  followed  by  u  zeroes  when  transmitted  on  the  input  links 

of  the  network.  These  zeroes  are  not  present  in  the  output  of  GEN  as 
described  In  (9. 2. a). 

in  order  to  insert  these  zeroes  in  the  data  streams,  we  use  the  multi¬ 

plexer  cells  (u.qt 6).  t/=0. •  •  •  ,/r-l.  to  multiplex  appropriately  the  data  on  the 
links  zu  g  with  the  zero  sequence  i.  The  operation  of  these  cells  is 

formally  described  by 

,  _  n  M3(k-u).3u 

‘•u.q+7  "  u  '"q+6u+3*  +  19%.qt6  ’  u 

With  the  help  of  Properties  P5  and  PI 5  in  Appendix  A.  it  can  be 

shown  easily  that  the  data  on  the  links  z  are  in  the  form  required  for 

u  ,q  rf 

the  proper  operation  of  the  matrix/vector  multiplication  network,  namely 


=  6ut3ktl9  p3k  (02  *e} 


^  it  n  +7 


u  =0.  •  *  •  ,k  -1  (9. 4. a) 
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where  T  (pu)  =  k  and 


H 


M*u(»  = 


t.t+u 


If  t<k-u 


If  t>k-u 


In  order  to  obtain  the  desired  vectors  y®,  e»l .•••.m,  the  components 
of  the  vectors  p®.  e=l must  be  provided  on  the  Input  link  rQ(Jt7 
according  to  the  formula 


0  _  n<7+3/t  +  19  3 k  2  _fi 

P0.g+7  -  n  P«=1.m(e  " 


(9.4.b) 


where  T  (n9)=k  and  jr®tt>  =  p®  is  the  ffft  component  of  the  vector  p®. 
From  ALG9,  we  have  p®  =  p  (glob  (e  .t)).  Applying  the  remark  of  Section 
3.1  with  c=3  and  using  the  technique  of  Section  5.4  for  verifying  the  pipe¬ 
lined  operation,  we  may  prove  that  the  output  on  the  link  ^^  7  is 

=  n<*+9 *+21  p3*  /Q2  „•>  (9  5) 

pk+2.qt7  n  ^a=l,m(e  ^  }  (9  5) 

where  T (7j®)=k  and  7?®(f)  =  y®.  Thus  7?®  contains  the  components  of  the 

vector  y9-Hep9. 


If  the  host  computer  collects  the  outputs  from  rkt2qi-7  an(i  accumu_ 
lates  each  element  y®  in  its  correct  position  y(glob(e.t) ).  in  one  time  unit, 
then  the  product  matrix  y=Wp  will  be  available  only  eight  time  units  after 
the  generation  of  the  elemental  arrays  is  completed. 

in  an  iterative  scheme,  the  formation  of  the  product  Hp .  for  a  certain 
vector  p.  constitute  the  major  part  of  each  step.  The  remaining  part  of 
each  iteration  step  is  less  time  consuming  but  depends  on  the  scheme 
being  used.  It  also  requires  some  intelligent  decisions  concerning  the  con¬ 
vergence  rate  and  the  stopping  criteria.  Although  it  is  possible  to  design 
special  hardware  for  the  completion  of  some  iteration,  we  believe  that  it 
would  be  more  appropriate  to  assign  this  task  to  the  host  computer. 


Figure  9.4  -  A  system  that  employs  iterative  solvers 

in  Figure  9.4.  we  show  a  block  diagram  of  a  systolic  system  that 
employs  an  iterative  scheme  for  the  solution  of  (9.3).  It  is  composed  of  a 
host  computer  and  two  systolic  functional  units;  namely  GEN  for  the  genera¬ 
tion  of  the  elemental  arrays  and  MULT  for  the  matrix/vector  multiplication, 
in  this  system,  the  host  is  more  Involved  in  the  computation  than  in  the 
system  that  employs  direct  solvers,  in  fact,  the  host  Is  a  general  purpose 
computer  that  executes  a  sequential  finite  element  program  and  uses  the 
systolic  units  GEN  and  MULT  as  high  speed  devices  to  perform  some 
compute-bound  operations  in  the  program. 

The  elemental  matrices.  H9 .  e=1.***.m.  are  used  throughout  the  itera¬ 
tive  solution  process.  Hence,  they  may  be  stored  in  the  auxiliary  storage 
STORE  (see  Figure  9.4).  and  repeatedly  used  in  successive  steps.  Note 
that  the  form  of  the  input  to  MULT  described  by  (9.4)  was  implied  by  the 
assumption  that  MULT  receives  its  Input  directly  from  GEN.  However,  if  the 

elements  of  H9 .  e  =  l.->*.m  can  be  fetched  from  STORE  at  a  rate  higher 
than  the  one  specified  by  (9.4),  then  GEN  may  generate  the  vectors  y  . 
e  =  l.  •••./»?  at  the  same  rate.  More  specifically,  if  we  replace  the  input 
sequences  (9.4)  by 


189 


/  =  na+2</  P*  la9) 

.q+7  n  e=1./n  (V 

p0,q+7  *  n*  /e»l.m<F*) 


with  some  initial  delay  a.  then  the  output  will  be  described  by 


p/t  +2.q+7 


’lt2tt2 


rather  than  by  (9.5).  That  is  we  may  increase  the  speed  of  MULT  by  a 

factor  of  three. 

But  for  use  with  practical  problems.  STORE  should  be  a  high  capacity 

storage  device.  By  current  technology  standards,  this  means  that  its  speed 
will  be  relatively  low.  Hence  we  may  not  be  able  to  supply  MULT  with  the 
needed  inputs  at  a  rate  faster  than  the  one  specified  by  equation  (9.4).  in 
that  case,  it  is  more  appropriate  to  eliminate  STORE  from  the  system  and 
to  use  GEN  for  the  regeneration  of  the  elemental  arrays  at  each  iterative 
step.  This  may  seem  to  be  an  unnecessary  computation.  However,  by 

applying  this  idea,  we  increase  the  speed  of  the  system  by  using  a 
resource  that  would  otherwise  be  idle. 

The  idea  of  regenerating  the  elemental  arrays  is  even  more  attractive  if 
a  multigrid  technique  is  used  for  solving  (9.3).  More  specifically,  it  is  clear 
from  ALG11  that  the  multigrid  technique  often  switches  from  one  grid  to 

another,  and  in  each  such  switch  the  global  stiffness  matrix  corresponding 
to  the  new  grid  has  to  be  generated,  in  that  case,  the  regeneration  of  the 
elemental  arrays  in  our  system  becomes  an  essential  operation  rather  than 
a  redundant  one.  Note  that  the  architectures  of  GEN  and  MULT  neither 
depend  on  the  specific  mesh  that  covers  the  problem-domain  nor  on  the 
bandwidth  of  the  resulting  matrix  H .  Hence,  the  matrices  corresponding  to 
the  different  meshes  may  be  generated  from  GEN  without  any  reconfiguration 
or  change  in  the  control  parameters. 


«  ' 
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Finally,  we  note  that  the  speed-up  in  the  finite  element  computation 

achieved  by  the  system  sketched  in  this  section  is  due  to  many  factors, 

namely.  1)  The  pipelining  of  the  generation  of  the  elemental  arrays.  This 

factor  is  more  prominent  if  a  multigrid  technique  is  used  to  solve  the  linear 
system  (9.3).  2)  the  elimination  of  the  time  consuming  fetch  operations  from 
the  slow  speed  auxiliary  storage.  3)  the  reduction  of  the  time  for  each 
iteration  step  by  a  factor  of  k  (kd  in  general,  as  we  will  see  in  the  next 

section),  which  is  the  speed-up  provided  by  the  systolic  network  MULT.  A 
general  mathematical  formula  for  the  overall  speed-up  provided  by  the  sys¬ 
tem  seems  to  be  impossible  to  obtain.  This  is  mainly  due  to  the  absence 

of  any  reasonable  criteria  for  the  estimation  of  the  implicit  speed  up 
obtained  by  the  smooth  flow  of  data  in  the  system  and  the  elimination  of 

the  store/fetch  operations  of  both  the  data  and  the  instruction.  in  fact 

more  research  needs  to  be  done  in  order  to  obtain  a  good  measure  for 

the  evaluation  of  systolic  systems. 

Next,  we  consider  the  decomposition  of  the  multiplication  operation 

y®=H®p®  in  the  case  of  problems  with  more  than  one  degree  of  freedom. 

9.2.4.  The  decomposition  of  the  matrix/vector  multiplication  for  d  >  1. 

th  — a 

in  problems  with  degree  of  freedom  d>1,  each  (/./')  element  H.  .. 

K/./Oc  of  the  matrix  H®  is  a  cfxo  sub-matrix.  Here,  we  denote  the 

t g.l)th  element  of  this  sub-matrix  by  .  f.  Similarly,  we  denote  the  gth 
element  of  the  d-dimenslonal  sub-vectors  y®  and  p®  by  y®  and  p®^. 

respectively. 

With  this  notation,  the  dk  components  of  the  vector  y®  =  H®p®  are.  for 


/=!,•••. At  and  g=1. •••.<*,  given  by 
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if  the  idea  of  Remark  3  in  Chapter  7  is  used,  then  the  systolic  system 

GEN  may  be  applied  to  generate  the  d2  elements  /  of  each  ^ 

simultaneously,  each  on  a  separate  output  link.  Hence,  each  link  z  _. 

U  t  f 

2 

0<uO r-1.  may  be  replaced  by  the  d  links  z  g./=1.  •••.<*.  where  the 
output  data  on  these  links  are  described  by: 


£  £ 
£  '  • 


r  .  q+6ut3k-H9  p3*  2  ;e 

S;g./  ‘  n  e=l,m  (  e  Mu;g.i 


with  7"  ;)  =  *  and 


^;g./a>  = 


f.f+u.g ./ 


if  t<k-u 


if  t  >k  -u 


19. 7. a) 


.v  •. 
.  . 


V  V 


This  is  a  generalization  of  the  formulas  (9. 4. a)  to  the  case  d>l. 

Then,  the  d  partial  sums  Af.  f.  g,/=l.***.d  may  be  generated 

2 

simultaneously  by  using  d  identical  copies  MULT^  f.  g,/=l,***,d  of  the 
multiplication  network  MULT.  The  inputs  to  each  network  MlfLT^  t  are  the 
generalized  forms  of  <9.4.a/b).  namely  (9. 7. a)  and 


,9+3*  tl  9  03k  ,a2  9  . 

“ov.!  *  n  p9.l.m(e  *»./> 


(9.7. b) 


where,  for  any  gr,  1<g<d,  T(;r®  j)=*  af,c*  ^  With  this,  it  is 

straight  forward  to  show  that  the  output  of  MlfLT  /  is 

'’“..m*®2  C/>  ,98> 


This  is  the  corresponding  general  form 


where  TO)®  ,)=k  and  ij®  .(t)=A®  .. 

y  .1  Q  ,/  t  .g  .1 

Of  (9.5). 


In  order  to  obtain  the  components  of  y®  .  the  d  output  links  r 

/  H  i  i  «y  ft 

/=1  .•••,<#  for  a  fixed  g .  l<g<d.  are  connected  to  an  adder  as  shown  in 
Figure  9.5(a).  Hence,  the  output  of  this  adder  is  described  by 


HUlT* 


r 

k+i  ; 


■®— 


(b)  interleaved  multiplication 


Figure  9.5 


_  _  n<7+9k+22  p3k  2  < 

pk+3;g  -  n  Pe=l.m(e  *£ 


(9.9) 


where  r(7>®)=k  and  Vg^=  L **  g  ,  *  X®g- 


O 

We  should  note  here  that  each  of  the  d  networks  MULT  /  operates 
once  every  three  time  units.  However,  the  efficiency  of  the  system  may  be 
improved  by  interleaving  the  computation  on  a  reduced  subset  of  these  net¬ 
works.  in  order  to  be  more  specific,  we  assume  that  d> 3  and  apply  the 
Remark  2  In  Chapter  7  to  increase  the  pipe  separation  during  the  operation 
of  GEN  from  3k  to  dk.  This  allows  us  to  use  only  one  link  Jug  to  multi¬ 
plex  the  data  on  the  d  output  link  zu,g  /=1.***.d.  In  terms  of  data 
sequences,  this  means  that  the  equations  (9. 7. a)  are  merged  into 


j  _  g+2du+d*+19  dk  _d- 1  -e 

n  -  n  P--l  -  W-i  (9  U 


0  =  1. m  1 


Vg.  1  ‘ 


Pu  d  )).  g=1.*  •  *.d  and  u=0.'  •  *./t-l 


(9. 10. a) 


We  may  then  use  only  d  matrix/vector  multiplication  networks 
MULT^ ,  •  •  •  .MULT^ .  and  apply  to  each  network  MULT  l  <g<d.  the  inputs 
described  by  (9.10.a)  and 

-  _  nq+dk  + 19  pdk  _.d-l  e  _d-l  d-1  e 

p0;a  -  n  <9  ”g.V'n  9  Vtf” 


-  n9t*+,»  /=f_,  <»•) 

e  =  i.m  g 


(9.  lO.b) 


Here,  for  any  g.  we  have  T(ii*)=kd  and  n°(i f)=  pf.  that  is.  the  rr/l  com- 

0  g  f 

ponent  of  the  kd  -dimensional  vector  p0. 

With  this  input  to  the  network  I/O  description.  It  is  easy  to  show  that 
the  output  of  MULT^  is 

-  _  rig+3dk+21  pdk  .^l.-.l.  _d-l  9  _d-l  Qd-1  e  ,, 

pk+2:g  -  n  ^l/l  (®  Vl'"-'n  ® 

The  accumulator  shown  in  Figure  9.5(a)  is  used  to  accumulate  every  d 
successive  items  on  The  operation  of  this  accumulator  is  formally 

described  by 

-  .q  t22,d  .1  - 

pk+3:g  "  n  A  p*+2:g 

which  gives 

—  _g+3dfc+21td  Ddk  ,-d- 1  e. 

pk  +3;g  *  n  %=l,m (e  ’o’ 

Thus,  we  obtain  the  same  results  as  with  the  sequence  Pfc+2  g  of  equation 

(9.9)  but  at  a  rate  of  one  result  every  d  time  units  rather  than  every  three 

time  units.  In  other  words,  in  order  to  increase  the  efficiency,  we  have  to 


reduce  the  rate  at  which  the  system  operates. 


r.'-’.'v.'v 


w>?. 
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10.  CONCLUSION. 

This  dissertation  is  intended  to  contribute  to  the  area  of  computer 
architectures  in  two  different  ways,  namely  (1)  by  formalizing  the  concept  of 
systolic  computations,  and  (2)  by  providing  a  basis  for  the  design  of  a 
systolic/pipelined  system  for  finite  element  computations. 

The  mathematical  model  suggested  for  systolic  networks  provides  a 
method  for  a  clear  and  precise  specification  of  systolic  computations.  It 
also  results  in  a  formal  technique  for  the  verification  of  the  operation  of 
systolic  networks.  The  central  concepts  in  the  abstract  model  are  those  of 
data  sequences  and  sequence  operators.  Although  we  only  defined  the  few 
operators  needed  In  this  work.  It  should  be  clear  that  further  sequence 
operators  may  be  introduced  to  model  other  types  of  computational  cells. 

The  computer  equation  solver,  written  to  supplement  the  abstract  model, 
is  intended  to  be  used  in  particular  for  the  computational  assessment  of 
systolic  networks  in  those  cases  where  the  analytical  verification  is  difficult. 
The  application  of  this  solver  is  equivalent  to  the  simulation  of  the  operation 
of  the  given  network  for  specific  input  data.  The  syntax  directed  approach 
used  in  the  implementation  of  the  solver/simulator  fed  to  a  very  modular 
program,  which  simplifies  the  task  of  introducing  new  sequence  operators  in 
the  future.  Actually,  the  addition  of  a  new  operator  (a  new  production  rule 
in  the  grammar)  requires  only  the  implementation  of  a  corresponding 
semantics  routine  that  describes  the  effect  of  the  operator. 

The  potential  of  the  abstract  model  extends  beyond  its  application  to 
clocked  systolic  networks,  in  fact,  the  discussion  in  Section  5.5  is  a  first 
step  toward  the  application  of  the  model  to  self-timed  systems  and  the  uni- 


form  treatment  of  systolic  networks  irrespective  of  the  method  used  for  syn¬ 
chronizing  the  operation  of  their  cells. 

Besides  its  value  in  demonstrating  the  power  of  the  systolic  model,  the 
design  of  the  finite  element  system  suggested  in  this  dissertation  may  be 

particularly  useful.  In  addition  to  being  adequate  for  VLSI  implementations, 

it  has  the  advantage  of  being  modular  in  the  sense  that  if  the  system  is 

designed  for  a  specific  number  k  of  nodes  in  each  element  and  order  q  of 
the  quadrature  formula,  it  can  be  easily  modified  to  perform  the  analysis  for 
different  values  of  k  and  q. 

Moreover,  by  applying  the  idea  of  pipelining  the  computations  for  the 

different  elements,  we  obtained  a  design  that  is  independent  of  the  domain 

of  the  problem  and  the  number  of  elements  in  the  grid.  It  should  be 

noted,  however,  that  the  LU  factorization  network  used  in  the  system  based 
on  a  direct  solver  does  depend  on  the  bandwidth  of  the  stiffness  matrix, 
which  in  turn  depends  on  the  finite  element  grid  and  on  the  numbering 
scheme  used  for  labeling  its  elements.  More  research  is  needed  in  order 
to  obtain  a  system  that  is  completely  independent  of  the  structure  of  the 
finite  element  grid. 

In  Chapters  7  and  8.  we  presented  an  analytical  verification  for  the 
design  of  the  different  components  in  the  finite  element  system.  The  design 

was  also  checked  using  the  solver/simulator  of  Chapter  4.  However,  due  to 

space  limitation,  we  did  not  include  the  details  of  the  simulation  in  this 
dissertation. 

Finally,  we  note  that  it  is  not  easy  to  define  a  measure  for  estimating 

the  efficiency  of  systolic  networks.  An  intuitive  measure  would  be  the  quo- 
TP 

tiont  where  T  is  the  time  needed  by  a  systolic  network  to  complete  its 
computation.  P  is  the  number  of  computational  cells  in  the  network,  and  C 


is  the  number  of  operations  to  be  computed  by  the  network.  This  measure, 
however,  does  not  take  into  consideration  the  type  of  operations  performed 
by  a  cell,  which  in  our  case  range  from  simple  memory  cells  to  floating 

point  dividers,  it  also  Ignores  the  benefit  obtained  by  the  regular  movement 
of  the  data  in  the  network,  in  [23]  a  more  elaborate  measure  was  sug¬ 
gested  that  takes  into  account  the  bandwidth  of  the  input  and  output  links 

of  the  network  in  comparing  the  efficiency  of  different  systolic  networks. 

Both  measures  estimate  the  utilization  of  the  computational  cells  in  a  net¬ 
work  without  differentiating  between  the  different  types  of  cells.  This  is 
acceptable  if  ail  the  cells  in  the  network  are  of  the  same  type.  However,  if 
the  network  contains  more  than  one  type  of  cells,  as  is  the  case  with  our 
system,  we  believe  that  the  utilization  of  each  cell  should  be  multiplied  by  a 
weight  that  reflects  the  hardware  complexity  of  the  different  cells.  More 

work  is  needed  to  develop  an  efficiency  measure  of  this  type. 


APPENDIX  A 


Properties  of  sequence  operators. 

In  this  appendix  we  list  some  properties  about  combinations  of  the  dif¬ 
ferent  operators  defined  in  this  dissertation.  All  the  properties  are  directly 
verifiable  from  the  definition  of  the  operators  and  are  very  useful  In  simpli¬ 
fying  any  manipulation  of  the  sequence  expressions. 

Most  of  the  properties  take  the  form  ‘sequence  expression  =  sequence 
expression*.  However,  some  have  the  form  ‘  sequence  expression  - 
sequence  expression",  where  we  formally  define  the  implication  operator  - 
as  follows: 


IF  for  any  t  either  yit)~Ztt)  or  y(t)-0  THEN  Z-V 


that  is  t?  is  equal  to  Z  after  replacing  some  of  its  elements  by  0.  Conse¬ 
quently.  if  t-y.  then  we  may  replace  {  by  y  in  any  sequence  expression 
as  long  as  0  is  treated  as  a  don't  care  and  not  as  a  special  symbol,  that 
is  in  the  contest  of  inert  computations.  Of  course,  if  Z-y  and  y-Z  then 
t=y. 


PI)  For  any  element-wise  operator  'op'  with  0  op'  0=6  we  have 

1.1)  For  r  =  n.  e,  E  or  P 


r<€)  'op'  Tty)  =  r<E  'op'  y) 


1.2)  M*1 . wnttv 


,  .  ,  .  ,  .w  1 . wn  . 

.lp)  'op'  Mr  (7)r 


M 


w  1 . wn 


■r  OP'  7)  -|]. 

1.3)  As  a  direct  result  of  PI. 2  we  have 


V  = 

OP'  yni) 


Z  op' 


w  1  ....wn 


(7? 


V 


w  1  wn 

vn  s  Mr  (I{  op  'op'  7?nJ) 


1.4)  if.  in  addition,  op'  is  a  0-reguiar  operator  then 


n'  i  'op'  7?  =  n'  c 


where  7(C)  =  minfT(7?)-r,7({)i  and  £(f)  =  Ht)  'op'  7j( ttr ) 

P2)  For  the  scalar  multiplication  operator  \\  It  follows  that 


2.1)  For  r  =  n.  0,  E  or  P 


w  .  T({)  =  TCw 
.w  l....wn 


2.2)  w  .  M”  . W"1  •  *.7)n)  =  Af*  l,...w/j ^  ^  u-jj.  •••.Iw 


'nl) 


PS)  Composition  of  n  with  itself 


3.D  n  n  t  *  n  € 


.-i 


3.2)  n  n  i  =  { 


-i 


3.3)  n  n  t  =  € 


If  and  only  if  £(l)=0 


3.4)  {  -  n  n 


-l 


P4)  Composition  of  n  with  9 


e'  n»  £  .  n^”*  er  £ 


for  r>0  and  any  k 


PS)  Composition  of  O  with  M 


5.D  n*  m'1 . »n(£]. 


.  u"' . 


•«»>  ■  “in"  <n  *V--n«»> 


for  any  rand  s  >  -r 


5.2)  fi  Af^1'"'wn(4v 


V  =  Wr 


W  1....WO 


mn  ■  nc,  .••••  ^n-! 


cox  cx*  .^l.-wn,, 

5.3)  fl  M r 


.n‘{„) 


where  K  =  w i  +  ■ 


■tw. 


5.4)  M*1,- "•*"(£,. 


rtl 


,  x  «r  -r  ,.w  l,..,wn  ...  ,  . 

V  =  n  n  *rtl 


5.5)  1 


,  X  _  Mwl....wn 

V  -  «!  <*t 


T 


.nqn~qzr 


where  <?  =  Wj  +  < 


■  +w 


/-I 


P6)  Composition  of  n  with  E 

e* 


6  ,>  n“ « 


=  ns  ej  £ 


for  any  r  and  s  >  r 


6.2)  ej+1  *  =  nr  rfr 


<tl  « 


6.3)  £*  n* 


n*  eJ  i 


X-  v"  V-O.vW.vU  -'.  A-V-S V.  uivAa.* A.’ JU 'Ji.  JL  .  rf- 
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P7)  Composition  of  n  with  A 

7.1)  A,  k  s  n“  {  =  n"  Ar-“'k-s  t 

7.2)  Ar*yk-3  t  =  nr  nr  { 

P8>  Composition  of  fl  with  P 


for  u  <  r 


8.1) 

nr 

ii,™1^ 

0  =1  ,/T) 

e) 

8.2) 

=l.m<n''  ‘*> 

■*  cf  Pk 

0=1  ,/n 

<* 

9) 

// 

T(«®) 

<  k-r 

8.3) 

n“‘ 

_/c  ■ 

“  Pe=l.m(n 

-r 

€#) 

/f 

Tde) 

<  k 

8.4) 

* 

-l.»,<n'r  «*> 

- n''  ^=i. 

m 

(*e) 

/f 

ar  rf' 

- 

8.5) 

* 

<«>  -  n* 

P9)  Composition  of  e  with  itself 


e  e  t  -  e  e  *  ■  e 


kr  tk  +r 


P10)  Composition  of  6  with  E 

10.1)  05_1  i  -  6s'1  t 

10.2)  0s  _1  £*  f5* 

r+1  *  sr+1 


0s'1  < 


Pll)  Composition  of  0  with  A 

a'-"-*  e3-'  «  =  e\  e3”'  e 

PI  2)  Composition  of  @  with  P 

P5*  (0S_1  **)  =  05  _1  Pk  (f9) 

0=1  ,m(e  «  )  0  pe=i,m(«  J 

PI  31  Composition  of  M  with  itself 

.1....1 


13.1)  if  =  Mr'"'  (7?^- •  *  then 

1 


•v  * 


*'V 


13.2)  ,V  . 


1  1  ,7?1  ’’  *’  -vm  '*1  ’ -*t?  for  m+n  <k 

PI  4)  Composition  of  M  with  A 

r  Jr  C 


14.1) 

.r  .k  .3 

A 

...1 

=  Ar*-3 

14.2) 

jt  .k  .1 

A 

M1' 

r 

...1 

whore 

for  1  <r 


1 ' 


i  -u 


•V 


U=1 
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P15)  Composition  of  M  with  P 


tW  1....W/7  ,rO 


e=l.m  1 


<£-j.  •  •  •  .£n))  for  bwlt*"+iwi 


PI  6)  Composition  of  E  with  P 

,6”  „<*?  <*> 

1621  <  «s>  * 

PI  7)  Composition  of  A  with  P 

*'*■'  c, ,«•>  -  «•» 

P18>  Other  properties  involving  the  multiplication  operator 

181)  ^r-H  *  *  *  "  •  ^rn"r  V  if  7 


for  r>0 


lo  r+1  «  *  -  •eitf’tl)  .  n  n  •  7?  /f  7(t?)  <  A+r 

18.2)  If  T(trXkn  for  r=0,  •  •  •  .n-1  then 

n'*<J  <t0.  •  •  ••{„_)>  *  pJw  * 

where  C  -  t?((/D  r)t1)  .  £  .  r=0.1.  •  •  •  ./>-!  and  □  Is  the 

r  n  r  p 

modulo  n  addition  operation  on  integers. 

P19)  If  77 1  j=1 .2 K  are  such  that  T  (7?J=n .  then 

£  -  rtA-l 

E  n  7?  .  =  n  t ? 

/=i  ' 

k 

where  7(7?)  =  r?-(A-l)  and  7?(f)  =  £  7 ?.(f+A-/). 

/  =  1  ' 

The  next  result  uses  the  ©  of  Section  2.1: 

P20J  Let  the  sequences  t\  ..  j=0.1 k  satisfy  =  n-i.  then 

v0  ©  n  t?1  ©  n2  7?2  ©  •  •  •  ©  n*  7^  =  y 


where  T  (y)  =  n  and 
f-1 

/ —  E  77  ,(f _/ ) 


v(t)  = 


E  77  (f  ~i ) 
/=o  ' 


f  =  1.2 . A 


f=A  +  1.A+2 . n 


Jj 


Next,  we  prove  some  lemmas  that  we  used  In  the  dissertation. 
Lemma  1:  The  difference  equation 


°/+l  =  n  °i  +  */ 


has  the  solution 


/  =s  .a  +1.  •  •  •  .k  +1 


*  nc<r's)  nc<,-s)  <a.2: 

I 

Proof:  The  proof  uses  induction  on  r.  Evidently,  for  i=s  in  (a.l)  we  obtain 

o„  . ,  =  nc  oo  t  A 
5  +  1  S  5 

which  is  identical  to  (a. 2)  for  r=s+l.  Hence  assume  that  for  any  r=s+l.  .. 
.k.  ar  is  given  by  (a.2).  then  from  (a.l)  It  follows  that 


°r +1  "  n 


-  nc  a,  ,  c  4 


*  Ar 


1  ncfrt’-5>  as  t  £  nc<,tl*s>  V/«-.  t  Ar 

■  «c<rt,'s)  v,«-, 

1=5-1 

=  nc(f+1~s)  a  +  E  nc(/~s)  a  , 

5  r-/+s 

/  =5 

which  proves  that  or  +  1  is  also  given  by  (a.2). 

Lemma  2:  let  be  the  ranking  function  for  the  set  X=U1 .  x2....xn).  as 
defined  in  Section  3.3,  then 

min(max(x,  .  f  (/-l.*-1«  .  /  (/.k-l))  =  f  U.k)  (a.3) 

XX  X  X 

Proof:  Let  ,  •••.  be  the  result  of  sorting  x1 .  x^_j  in 

ascending  order,  and  z1 ,  ••*,  the  corresponding  result  for  x^.  •  ••.  x^. 


Hence,  f  (f-l.k-1  )=y.  f  (l.k-l  )=y.  and  f  U.k)-z..  Now  consider  the  foi- 

V  'lol  v  7 i  y  i 
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lowing  cases: 

D  if  *k  <  <  Yf  then  the  left  side  of  (a.3)  is 

minfmaxO^  ,  y/_1  J  .  yyi  =  minfy/_1  .  y.)  =  y/1 

Since  2^ . •  •  •  .2^  are  obtained  from  y1.***.y^_i  by  inserting  in  some 
position  before  Y/_y  we  immediately  see  that  y/_1  =  2 ’. 

2)  If  y/_1  <  »k  <  y. .  then  the  left  side  of  (a.3)  is 

minfmaxGr^  .  y^)  .  yf)  -  xk 
and  in  this  case  it  is  clear  that  =  zf. 

3)  if  y^  <  y^  <  .  then  the  left  side  of  (a.3)  is  equal  to  y. .  which  in 

turn  Is  equal  to  because.  In  this  case,  x ^  is  Inserted  in  some  position 
after  y.. 


Lemma  3  :  The  system  of  difference  equations 


7+1 

J2.  .,1.2, 

*  n  MSfiU 

.  +  p;D 

II 

O 

• 

• 

• 

(a.4) 

/  = 

n  + 

p/J  .  0 

O 

s* 

II 

(a.  5) 

with  the  conditions  prQ  =  1  and  s+rQ>3  has  the  solution 


n  u'J,  cx,  .  1) 

[n  MjVV m2x,_,  t  x,i  .  t) 

n  M]+/(In4x/-2  +  n2x/-1  +  V  ' 


/*r0tl 

/=ro+2-' 


Proof:  The  cases  f=rQ  and  r=rQt  1  are  easy  to  verify  by  direct  substitution. 
Hence,  we  will  consider  only  the  case  />Tq+2.  First,  we  will  prove  that  the 
solution  of  (a. 4)  is 


1  2  2  4  * 

p.  =  Mst/(tn  X/-!  +  n  X/-2J  •  0  } 


/=r0t2.' 


For  this,  we  use  (a.4)  twice  to  get 
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pi  ~  n  wa+/_i(t  •  lX)--\  + 

"  °2  •  IX/-1  +  n2^+2/-2(t  •  W'W” 

By  applying  Properties  P5.1.  P5.2  and  PI. 3.  and  replacing  the 

M 

sequences  non  relevent  to  the  computation  by  0  .  we  obtain 

pi  =  °2  w]'2-iu  •  ♦  n%-2  ♦  n2P/_2]» 

=  n2  o'.  tx/_1+n2x/_2+n2p/_2J) 

■  m}3  +y 1  (in2x/ _1+n4x/ _2tn4P/ _2i.  o',  o')  (a.7) 


Now.  if  /=r0t 2.  then,  from  the  hypothesis.  P/ _2=Pf  0~ L '  and  ,hus  <a  7) 
reduces  to  (a. 6).  Else,  if  />rQ+2.  then  we  replace  P/_2  In  (a.7)  by  its 
value  from  (a.4).  This  gives 

pi  *  MJ'2(In2x/-l  +  n4x/-2  +  nXt/-3(l  *  X/-3^/-3,l  •  <>*> 

*  M5,2(tn2x/_i  ♦  n4x^. _2  +  m^‘2  (n6i.o')j  .  o') 

« 

where,  again,  we  replaced  [X^g+p^g]  by  0  .  Now,  Property  PI. 3  gives 

o,  ■  ♦  n6‘i  •  o'  •  o") 

*  M]'2/«n2x/_1*n4kj_2*neii  .  o') 
which  reduces  to  (a. 6)  for  s+i> 6.  that  is  s+rQ>3. 

Next,  we  substitute  (a. 6)  into  (a. 5)  and  apply  PI. 3  to  obtain 

•  n  "H*  I  •  ‘> 

=  n  ^  ^  (tx/+n2x/ _1+n4x/ _2j  .  i  ) 
which  is  the  required  formula  for  the  case  />rQ+2. 


APPENDIX  B 


The  grammar  for  the  SCE  language. 


Terminal  symbols 


PAR 

INDEX 

SEQN 

FOR 

DO 

END 

MAXTIME 

INPUT  OUT 

Comma 

Semi 

Colon 

Equal 

Plus 

Minus 

Mult 

Div 

Lbrak 

Rbrak 

Lcurl 

Rcurl 

Lpar 

Rpar 

Period 

0 

Z 

T 

A 

E 

M 

U1 

U2 

Identifier  Positive_integer  Positive_real 

Grammar  rules 


1)  <prog>  ::=  <dectare>  <ln_part>  <body>  <out_part> 

2)  <deciare>  ::=  <par_deci>  <lndex_decl>  <seqn_decl> 

/«  PARAMETER  DECLARATIONS  V 

3)  <par_deci>  ::=  PAR  <par_iist>  Semi 

4)  i 

5)  <par_list>  ::=  <par_list>  Comma  <par_stmt> 

6)  I  <par_stmt> 

7)  <par_stmt>  Identifier  Equal  Posltlve_integer 

/*  INDEX  DECLARATIONS  V 

8)  <index_decl>  ::=  INDEX  <l_list>  Semi 

9)  I 

10)  <i_list>  ::=  <l_list>  Comma  Identifier 

ID  i  identifier 

/«  SEQUENCE  DECLARATIONS  V 


12) 

<seqn_deci> 

SEQN  <dlm_list>  Semi 

13) 

<dim_ilst> 

<dim_llst>  Comma  <seqn_dim> 

14) 

< 

» 

<seqn_dim> 

15) 

<seqn_dim> 

Identifier  Lcurl  <range_list>  Rcurl 

16) 

i 

i 

identifier 

17) 

<range_Hst> 

<range_llst>  Comma  <range> 

18) 

i 

i 

<range> 

19) 

<range> 

<l_expr>  Colon  <i_expr> 
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/*  INDEX  EXPRESSIONS  */ 


20)  <l_expr> 

21) 

22)  i 

23)  <l_term> 

24)  I 

25)  <l_factor>  ::= 

26)  ! 

27)  i 

28)  <simple_factor>  ::= 

29)  I 


<i_expr>  Plus  <l_term> 
<i_expr>  Minus  <l_term> 
<l_term> 

<l_term>  Mult  <l_factor> 
<l_factor> 

<slmple_factor> 

Minus  <slmple_factor> 
Lpar  <l_expr>  Rpar 
Identifier 
Positlve_integer 


/*  THE  BODY  OF  THE  PROGRAM  V 

30)  <body>  <stmt_list>  Semi 

31)  <stmt_iist>  <stmt_list>  Semi  <stmt> 

32)  <stmt> 

33)  <stmt>  <eqn> 

34)  !  <for_stmt> 

35)  <for_stmt>  FOR  <for_spec>  <stmt_list>  END 

36)  <for_spec>  Identifier  Equal  <l_exp r>  Comma  <i_expr>  DO 

37)  <eqn>  <seq_spec>  Equal  <seq_expr> 


/*  SEQUENCE  SPECIFICATIONS  */ 

38)  <seq_spec>  ::=  Identifier  Lcurl  <lndlcat_iist>  Rcurl 

39)  I  Identifier 

40)  <indlcat_list>  ::=  <indlcat_list>  Comma  <l_expr> 

41)  !  <l_expr> 


/*  ELEMENT  WISE  OPERATORS  ON  SEQUENCES  ■/ 


42)  <seq_expr> 

43) 

44) 

45)  <seq_term> 

46) 

47) 

48) 

49) 

50)  <s_factor> 

51) 


:=  <seq_expr>  Plus  <seq_term> 
<seq_expr>  Minus  <seq_term> 
<seq_term> 

;=  <seq_term>  Mult  <s_factor> 
<seq_term>  Dlv  <s_(actor> 
<seq_term>  U1  <s_factor> 
<seq_term>  U2  <s_factor> 
<s_factor> 

:=  Pos!tive_real  Period  <seq_factor> 
<seq_factor> 
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/«  INPUT  SPECIFICATIONS  V 

68)  <ln_part>  INPUT  Lpar  <lnp_list>  Rpar  Semi 

69)  <inp_list>  ::=  <lnp_llst>  Comma  <lnp_spec> 

70)  I  MAXT  Positlve_integer 

71)  <inp_spec>  ::=  <seq_spec> 

72)  I  FOR  <lo_for>  <lnp_spec> 

73)  <lo_for>  Identifier  Equal  <l_expr>  Comma  <i_expr> 


/*  OUTPUT  SPECIFICATIONS  */ 

74)  <out_part>  ::=  OUT  Lpar  <out_llst>  Rpar  Semi 

75)  <out_iist>  ::=  <out_llst>  Comma  <out_spec> 

76)  l  <out_spec> 

77)  <out_spec>  ::=  <seq_spec> 

78)  i  FOR  <lo_for>  <out_spec> 
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APPENDIX  C 


The  listing  of  the  SCE  Interpreter  program 


/ «  GLOBAL  DECLARATIONS  **/ 

♦include  <stdio.h> 

♦define  Pros_iensth  650 

♦define  N.rules  78  IX  number  of  rules  in  the  grammar  %/ 

♦define  N.duas  24  /%  rules  not  reouirins  any  action  */ 

♦define  N- symbol  20  IX  size  of  the  symbol  table  4/ 

♦define  N-bound  20  /X  size  of  the  bound  table  %/ 

♦define  N-seon  129  /X  max.  nueber  of  seauences  used  X/ 

♦define  Naxtiae  30  /X  upper  liait  on  siaulation  tiae  X/ 

♦define  stack-length  40  IX  length  of  working  stack.  X/ 

♦define  N-words  <<Maxtiae  •!■  1)  /  16  >  ♦  1 

♦define  N.ratios  5  IX  max.  ♦  of  arguments  in  M  operators  X/ 

♦define  N-real  5  IX  max.  ♦  of  real  constants  used  X/ 

int  overflowC3!  3  -C  Prog-length?  stack-length?  N_symbol?  N-bound? 

N-sean?  Maxtiee  t  1  ?  N_ratios?  N_real  >  ? 

FILE  *fopen()  ?  *fp  f 

/*  an  arras  to  store  the  program  triples  XI 
struct  <  char  action  ? 

int  value  ? 

int  top  ?  >  prosCPros. length!  ? 

int  location  ?  /X  pointer  to  prog  array  X/ 

/*  adJustCi!  contains  the  length  of  the  R.H.S.  of  graaaar  rule 

i  minus  one?  which  is  the  adjustment  in  the  top  of  the  stack  XI 

int  adJustCN- rules!  =  <-3?  -2?  -2?  1?  -2?  0?  -2?  -2?  1?  -2? 

0?  "2?  "2?  0?  “3?  0?  -2?  0?  -2?  -2? 

-2?  0?  -2?  0?  0?  -1?  -2?  0?  0?  -1? 

-2?  0?  0?  0?  "3?  -5?  ~2?  _3?  0?  -2? 

0?  -2?  ~2?  0?  _2?  -2?  “2?  _2?  0?  -2? 

0?  0?  “1?  -3?  -8?  -2?  ”5?  -1?  -1?  -1? 

-3?  -2?  1?  -2?  0?  -2?  0?  -4?  -2?  -1? 

0?  -2?  -4?  -4?  -2?  0?  0?  -2 

>> 

IX  dumsC!  contains  the  number  of 'the  grammar  rules  not  reeuiring 
any  action  X/ 

int  dumsCN-dums!  3  <  3?  4?  3?  6?  8?  9?  12?  13?  14?  22? 

.  24?  23?  29?  31?  32?  33?  34?  44?  49?  51? 

69?  74?  73?  76 

>« 

int  stackCstack-length!  »  IX  dynamic  working  stack  X/ 

float  fstackCstack-lenSth]  ?  IX  a  matching  value-stack  */ 

/X  SYMBOL  TABLE  •  type*  S?  P  or  I  for  seen,?  parameter  or  index?  respectively.  */ 
struct  {  char  type  ? 

int  entry If 

int  entry2f  >  sym_tabCN_symbol!  f 


int  lboundCN-bound!  ? 
int  uboundC N-bound!  I 
int  ran.Ptr  »  -1  ) 


/X  lower  bound  table  XI 
IX  upper  bound  table  */ 

/*  pointer  to  bound  table  */ 
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£-1 


/*  SEQUENCE  STORAGE  */ 

float  sea_storeCN_$ean3CMaxtiiie+13  >  /*  seouences  storage  */ 

unsigned  d-tableCN-sean3CN-uords3  f  ft  keeps  tracks  of  don't  cares*/ 

int  saa.ptr  3  0  *  /*  pointer  to  saouenca  store  */ 

int  sea-size  5  /*  actual  siae  of  the  table  */ 

int  last_coaputedCN.sean3  *  /*  for  consistency  checks  */ 

float  r_storeCN_real3  i  ft  storage  for  real  constants  */ 

int  r.Ptr  *  -1  *  /*  the  corresponding  pointer  */ 

int  ratio_tei*pCN_ratios3  *  ft  to  store  eultiplexer  ratios  */ 

int  ratio-ptr  I  ft  a  pointer  to  ratio.teep  */ 

int  not-done  5  ft  a  loop  control  variable  */ 


/ttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttf 
ft  THE  MAIN  PROGRAM  */ 

*ain< ) 

< 

char  action  * 

int  value  * 

int  top  ■»  -1  i  ft  current  top  of  stack  */ 

int  j *  looking  * 
int  stage  * 

int  end-stageC53  * 

end-staSeCl332  *  end_staaeC2T*68  * 
end_staseC33330  I  end_sta3eC4]»l  5 

ft  open  the  file  that  contains  the  prograe  tuples  */ 
fp  *  fopen( ’out. parse* *  *r*>  ) 
for(seo_ptr»0  5  sea-Ptr  <  N.sean  *  sea_ptr++) 
for(jaO  J  J  <  N-uords  *  J++) 
d_tableCsea_ptr3CJ3  »  0  t 
sea-Ptr  -  0  I 

ft  Build  standard  entries  in  syebol  table*  they  can  be 
overwritten  by  proper  declaration  tf 
for(J«0  *  J  <  3  *  j++) 

■C  sye_tabCJ3.type  *  'S'  * 

syi»_tabCj3. entry 1  3  sea-Ptr  ? 
last-coeputedCsea_Ptr++3  3  Maxtiee  * 

sy»_tabCJ3.entry2  3  -1  »  ft  indicating  a  single  seauence  */ 

> 

for(J33  J  j  <  N.syebols  »  j++) 
sya.tabCJ3.type  *  '  '  * 
for(J=l  J  j  <  Maxtiee  *  J++) 

•C  sea-storeC13CJ3  *  0.0  * 
sect_storeC23CJ3  3  1.0  J  > 
for(J=0  *  j  <  N-uords  >  J++) 
d_tableC03CJ3  3  0177777  I 

for(ja3  *  J  <  N_sean  I  j++) 
last.coeputedC J3  3  0  $ 

ft  END  OF  INITIALIZATION  AND  BEGINNIG  OF  MAIN  LOOP  tf 


ft  Outer  for  loop!  stase3l  ->  declarations* 
stage*2  ->  input  part* 
stage»3  ->  prograe  body* 
stage3*  ->  output  part,  tf 
for(stage=l  *  stage  <=  4  *  Hstase) 
i. 

location  3  0  * 

not_dcne  3  i  * 

ft  inner  loop  l:  read  the  tuples  for  the  corresponding  stage 
from  file*  keep  track  of  the  top  of  the  stack  and  save  in 
progC3  only  those  triples  that  reouire  a  certain  action  tf 


do 

< 

fscanff  fp  >  'Xc'  *  laction  )  5 
if (action  ! 3  'L')  fscanf(  fp  *  'XdNn’  *  lvalue  ) 
else  <  check<7*  ++r_ptr)  * 

fscanf(  fp  *  ’Zf\n'  *  4r_storeCr_rtr3  ) 
switch  (action) 


> 


/%  REDUCE  ti 


< 

case  'R' 


ease  'C‘ 


l  <  lookins  *  1  )  j  *  0  * 
vhile( lookins  tt  J  <  N_du»s) 

if (value  33  du»sCj++!)  look ins  -  0  * 
if(lookinS)  push_triple(8Ction*  value»  top)  * 
chcck( 1»  (top  +*  3dJustCvalue-l!)  >* 
if (value  33  end-staseCstase!)  not_done=0  I 
break  ’ 


> 

1  {  check* 1*  ++top> 
push_triple( action*  value*  top  )* 
break 


it  SHIFT  IDENTIFIER  %/ 
it  OR  INTEGER  CONSTANT  */ 


y 


ease  L'  1  <  check <l»++toe>  i  it  SHIFT  REAL  CONSTANT  %/ 

push_triPle(action*  r-ptr*  top)  * 
break  I 

> 

case  S'  J  <  check (l*++top)  i  it  SHIFT  %/ 

break  * 

> 

case  'A'  :  <  check* 1*  Htop)  )  /*  ACCEPT  */ 

push_triple(action*  value*  top  )* 
break  * 

% 

> 

> 

while (not-done)  I 


location  *  -1  * 
not-done  3  1  * 

it  inner  loop  2t  execute  the  action  routines  for  that  stase  ti 
while (not-done) 

< 

Hlocation  * 

value  3  prosC location!. value  * 
top  3  prosC location!. top  ) 
suitch(  proSClocation!. action  ) 

■C 

case  ’C’  .*  <  stackCtop]  3  value  *  break  *  > 

case  'L'  t  <  fstackCtop]  3  r.storeCvalue!  *  break  *  > 

case  'R'  :  <  semantic* value*  top)  *  break  *  > 

case  'A'  5  not-done  *  0  * 

> 

> 

> 

felose(fp)  > 

> 

it  END  OF  THE  MAIN  PROGRAM  ti 


ittttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttti 

it 

A  routine  to  store  a  triple  in  the  arras  prosC! 
ti 

push_triple<3  *  v*  t)  char  a  *  int  v*t  * 

•C 

prose location! .action  3  a  * 

-  prose location!. value  3  v  * 
proSClocationl.toP  3  t  * 
check(0*  Hlocation)  * 
return**))  * 

> 


/mm*m*mmmmmmmmmmmmmm*m**m*sxs*******  2 1 0 

/*  THE  SEMANTICS  ROUTINES 

/mxmmmmxxxxxxmxxmxxxxxxmxxxxxxxxxmxmx¥*«m*x*xxxtxxx» 


int  declarina  *  1 
int  ski?  *  0 
int  MskiP3  0 
float  setf loat ( ) 
float  tfloat 
int  TIME  3  1 
int  d-flas 


/%  *1  only  durins  declaration  */ 

/%  to  skip  calculation  in  case  of  don't  cares  */ 
/%  to  chose  the  araueent  in  M  operator  */ 

/*  teeporara  variable  */ 

/X  dlobal  sastee  tiee  */ 


seeentic(rula»  top)  ini  ruler  top  > 

r 

\ 

int  t0>  t2»  readina  »  J  * 

if (Mskip) 

<  switch (rule) 

<  case  54  :  break  • 

C3se  64  J  —stack!  top-43  i 

case  65  1  — Mskip  f 

default  .*  retum(O)  i 

> 


if (skip) 

■C  suitch(rule) 
-C  case  53  i 
case  54  J 
case  57  t 
case  58  1 
case  5?  : 
case  60  5 
case  61  2 


<  —skip  »  return (0)  j  > 


<  skip++  )  return(O)  i  > 


default  :  return(O)  5 


suitch(rule) 

case  1  1  -C 
case  2  2  < 


•C  not.done  *  0  »  return<0)  >  > 
<  declarina=  0  * 


not.done  a  0  i 
sea-size  3  sea-ptr 

PARAMETER  DECLARATION 


l  i  return(O)  » 


/*  sianal  end  of  staae  4  */ 

/*  sianal  end  of  declarations  */ 
/t  that  is  staae  1  t/ 

i  > 


case  7 


:  {  t2  3  stackCtop-23  *  check(2»t2)  ! 
if((t2  >  2)  It  ( sae. tab Ct23. tape  !3 
saa_tabCt23.entrsl  3  stackCtop]  f 
sae.tabCt23. tape  3  '?'  i 

return(O)  )  > 

INDEX  DECLARATIONS  */ 


case  102 

case  11 :  {  to  3  stackCtop 3  J  check(2»t0)  ) 

if ((t0  >  2)  St  (sae.tabCt03.tape  !3 
sae-tabCt03.entral  3  0  • 

sae_tabCt03.entra2  3  0  i 

sae.tabCt03.tape  3  'I'  * 

return(O)  »  } 

t  SEQUENCE  DECLARATIONS  */ 

case  152  <  check(3>  ++ran-Ptr)  ) 
lboundCran.ptr3  3  0  > 
uboundCran_ptr3  3  0  > 
sea.ptr  3  sea_ptr  +  stack! top-1 3  J 
check(4*  sea.ptr)  ) 
return(0)  >  > 

case  16*.  <  tO  3  stackCtop 3  > 

if ((t0  >  2)  tt  <sae_tabCt03.tape  !» 
sae.tabCt03.tape  3  'S'  i 

sae.tabCt03.entral  3  sea.ptr  i 

sae.tabCt03.entra2  =  -1  J 

check(4»++sea_ptr)  ! 

return(O)  »  > 


))  run_error(15)  » 


'))  run.error(15) 


))  run.error(15)  » 


oast  1- 


*  stack! top 3  l  stack C  top-23  > 


stack.!  top-23 
return! 0)  J  > 
cast  18:  <  t2  *  stack.Ctop-23  >  check(2»t2>  * 
if((t2  >2)  IS  ( sum. tabCt23 .type  !■ 
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sy*_tab!t23. entry!  *  sea.Ptr 
sym_tab!t23.entry2  =  ran_Ptr 
sym_tab!t23 • tape  *  ’S’ 
rtturn(O)  !  > 

cast  19:  <  ehtck(3»++ran_ptr>  *  ,  . 

abound! ran_Ptr 3  *  stackCtop 3 
lboundCran_Ptr3  *  stackCtop-23 


'))  run.tr ror( 13)  » 


stack C tot-23 
rtturn<0)  •  > 


stackCtop!  -  stackCtop-23  +  1  * 


INDEX  EXPRESSION 


*/ 


cast  20 : 
oast  21 : 
cast  23: 
cast  26: 
case  271 
case  23 : 


stackCtoP-23  »  stackCtop-23  +  stackCtoe]  » 
rtturn(O)  1  > 

stackCtop-23  *  stackCtop-23  -  stack!top3  » 
return(O)  *  > 

stack! toe-23  *«  stack! toe 3  * 
rtturn(O)  t  > 

staekCtoP-13  *  -  stack!top3» 
return(O)  J  > 

stack! toe-23*  stackCtop-13  > 
return! 0)  J  > 

to  «  stackCtop 3  i  check (2» to )  » 

if (dtclarint  tl  ! sum. tab! t03. type  !*  'P'))  run_trror(4)  » 
if(!sym_tabCt03.typt  **  'P')  II  ! sum. tab! t03. type  33  'I')) 
<  stackCtop 3  »  sym_tabCt03.entryl  1  return<0)»  > 
run_error<3)  I  > 


/* 


PROGRAM'S  BODY 


X/ 


case  30:  <  if C++TIHE  <*  stack!13)  <  location  »  -1  i  return(O)  i  > 
printfCNn  XXXX  OUTPUT  SEQUENCES  «**\n*)  i 
not.done  *01  IX  sidnal  end  of  stase  3 

return<0)  J  > 

ease  33:  <  to  3  stackItop-23  i 

if (sy*_t3bCt03.entryl  >=  sym_tab!t03 .entru2)  return(O) 
++sy»_tabCt03.entryl  i 
location  *  stack! tOP-33  } 
return! 0)  i  > 

case  36:  <  t2  3  stackCtop-53  i  check(2»t2)  i 

if (3sm.tabCt23.tupe  !=  'I')  run_error(6) 


*/ 


sym.tabCt23«entryl  *  stack! top-33 
sy*_tabCt23.entry2  3  stackItoP-13 
stack! top-63  *  location  f 
return(O)  »  > 

case  37:  <  t2  3  stackCtop-23  i  check! 4»t2)  5 
if (last_comPutedCt23++  !=  TIME-1) 
if (stack! top!)  <  write.d(t2fT IME) 
S9Q.store!t23CTIME3  3  fstack!top3 
return(O)  *  > 


/X  initial  value  */ 
/X  final  value  X/ 


run.error(ll) 
f  return(O)  i 


/X 


SEQUENCE  SPECIFICATION 


X/ 


case  3S:  <  t2  3  stack! top-33  5  tO  =  stack! top-1 3  i 

if  (lboundCHran.pt  r  3  II  uboundCran_ptr])  run_error!3) 5 
sea.Ptr  3  sya_tabCt23. entry 1  +  to  > 
stack! tOP-33  3  seo-Ptr  ? 
return(O)  »  > 

case  39:  <  to  3  stackCtop]  » 

if ( sum. tab! tO ]. type  !®  'S')  run_error(13)  i 
if(sy».tabCtO].entry2  I-  -1)  run.error! 14) i 
stack!top3  3  sym.tabCt03.entryl  1 
return(O)  •  > 

case  40:  <  to  =  stack!top3  i 
ran_ptr++  » 

ifltO  <  lbound!ran_Ptr3  1 1  tO  >  uboundlran.Ptr])  run.error(l)  » 
stack.Ctop-23  *  stackCtop-23  *  (uboundCran.Ptrl  - 

lboundCran.ptr!  +  1)  +  (stack!top3  -  lboundCran.ptr!)  > 
return (0)  I  > 

case  41 :  <  t2  *  stackCtop-23  f  tO  *  stack! top 3  » 
check(2»t2>  1 

if (sum.tabCt23.type  !*'S'  II  su»_tabCt23.entry2  <  0)  run.error(12) 
ran_ptr  *  sy*.tabCt23.entru2  ) 

i f < tO  <  lboundCran.ptr]  1 1  tO  >  ubound!ran_Ptr3)  run_error(l)» 
stack! top 3  3  stackCtop]  -  lboundCran.ptr 3  5 
return! 0)  5  > 
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/*  ELEMENT  WISE  OPERATORS  ON  SEQUENCES  */ 

case  42*  <  if (stackCtop]  I)  stackCtop-23) 

■CstackCtop-233l  »  return(O)  »> 
fstackCtop-23  4-=  fstackCtop]  » 
stackCtop-23  3  0  » 

return(O)  J  > 

case  42:  <  if (stackCtop]  II  stackCtop-23) 

CstackCtop-233l  >  return(O)  ?> 
fst3ckCtop-23  -=  fstackCtop]  i 

stackCtof-2]  =  0  * 

return(O)  •  > 

c»se  45:  C  if (stackCtop]  II  stackCtop-23) 

-Cst3ckCtop-23=l  i  peturn(O)  J> 
fstackCtop-23  *=  fstackCtop]  5 

stackCtop-23  =  0  » 

paturn(O)  »  > 

case  46:  <  if (stackCtop]  II  stackCtop-23) 

<stackCtop-233l  >  peturn(O)  »> 
if (fstackCtop])  <  fstackCtop-23  /-  fstackCtop]  » 
stackCtop-23  =  0  >  > 
else  stackCtop-23  3  1  » 
return(O)  4  > 

case  47:  {  stackCtop-23  *  u_OPl(fstackCtop-2]»  staekEtop-2]»  fstackCtop]* 

stackCtop] 7  itfloat)  J 
fstackCtop-233  tfloat  } 
return* 0)  »  > 

case  48:  <  stackCtop-23  *  u_op2(fstackCtop-23*  stackCtop-23*  fstsckCtop]* 

stackCtop]*  ttfloat)  * 
fstackCtop-233  tfloat  * 
return(O)  4  > 

case  50J  <  if (stackCtop])  -Cst3ckCtop-2]3l  *  peturn(O)  *  > 
fstackCtop-23  fstsckCtop]  * 

stackCtop-23  =  0  * 

peturn(C)  i  > 

/*  OPERATORS  DEFINED  DIRECTLY  ON  SEQUENCES  */ 

case  525  C  tO  3  stackCtop]  *  check (4»t0)  * 

J  *  last-conputedCtOJ  * 

if  (TIME  >  J)  run-error <105 * 

if < read_d< tO*TIME) )  <  stackCtop]  3  1  *  return(O)  *> 
fstackCtop]  3  sea.storeCtOKTIME]  * 
stackCtop]  3  0  5 

return(O)  5  > 

case  53:  <  TIME  =  stackCtop-13  5 

fstackCtop-1]  3  fstackCtop]  5 
stackCtop-13  3  stackCtop]  * 
return (0)  *  > 

case  54:  <  if (stack! top-33)  run_error(7)  * 

MskiP  «  0  * 

stackCtop-33  3  stackCtop-13  5 
fstackCtop-33  3  f stackCtop-13* 
return(O)  *  > 

case  55*.  <  t2  3  stackCtop-43  5 

if (TIME  <  t2)  <  stackCtop-83  3  1  * 
fstackCtOP-833  0  5 
return(O)  J  > 

t2  3  (TIME  -  t2)  r.  (stackCtop-23  *  stackCtop-43)  * 

tO  3  stackCtop]  i 

J  3  last.coaputedCtO]  * 

if (TIME  >  J)  pun_error<10)  > 

tfloat  3  0.0  > 

for( J»(TIME  -  t2)  J  J  <»  TIME  f  J  3  J  +  stackCtop-23) 
if < read_d(tO* J) )  <  stackCtop-83  3  1  * 
fstackCtop-833  0  5 
return(O)  *  > 
else  tfloat  +3  seo_storeCt03CJ]  * 
stackCtop-83  3  0  * 
fstackCtop-83  3  tfloat  5 
return(O)  *  > 

case  56J  <  fstackCtop-23  3  fstackCtop- 13  * 
stackCtop-23  3  stackCtop-13  * 
return(O)  *  > 
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C3se  571  <  t2  *  stack!top-33  »  tO  3  stack!top-13  i 
if (TIME  <  t2)  <skip=l  i 

stack! top-53  *  1  t 
fstackC top-53  3  0  »  return(O)  f  > 
t2  3  (TIME  -  t2)  7.  tO  i 
for(J*0  »  t2  >3  ratio_teiiP[j3  »  J++)  » 

siackCtoP-53  3  ratio.Ptr  f  /*  cardinality  of  list*/ 

Mskip  3  J  i  ft  chosen  element  */ 

return(O)  »  > 

case  581  <  tO  3  stack! toe]  » 

if (TIME  <=  tO)  <  skipsl  1 

stack! top- 13  3  1  > 
fstack!top-13  3  0  »  return(O)  (> 
stackCtop-13  3  TIME  > 

TIME  3  TIME  -  tO  i 

return(O)  \  > 

case  591  <  to  3  stack!top3  J 

if (TIME  <3  tO)  <  skip  3  1  } 

stack! top-1 3  3  0  » 
fstackItop-13  3  0  »  return(O)  »  > 
stackItop-13  3  TIME  » 

TIME  3  TIME  -tO  i 
return(O)  »  > 

case  601  <  tO  3  stack! top 3  » 

t2  3  (TIME  +  tO)  X  (1  (•  tO)  * 
if(t2)  <  skip  =1  »  /%  don't  care  %/ 

stackItop-13  3  1  » 

fstack!top-13  3  0  )  return(O)  >> 

t2  3  (TIME  +  tO)  /  (1  *  tO)  1 
stack! top- 13  3  TIME  i 

TIME  3  t2  5 

return(O)  «  > 

case  611  <  tO  =  3 tack! top- 13  t  t2  3  stackItop-33  f 
if (TIME  <  t2)  <  skip  3  1  i 

stack! top-53  3  1  i 
fstackCtop-53  3  0  1  return(O)  »  > 
stackC top-53  3  TIME  > 

TIME  3  TIME  -  ((TIME  -  t2>  X  tO)  » 
return(O)  »  > 

case  621  <  stackItop-23  3  stackCtop-13  )  return(O)  »  > 

case  631  <  stackItop+13  3  1  »  return(O)  »  > 

case  641  <  —stack!  top-43  f 

stackEtop-23  3  stack!top3  i 
fstackItop-23  3  fstackltopl  i 

> 

case  651  !-  — Mskip  i 

return(O)  i  } 

case  661  \  check(6»  ++ratio_ptr)  i 

stackCtOP-23  +=  stack!top3  i 

ratio. temp! ratio.Ptr3  3  stack!top-23  » 

return(O)  >  > 

case  671  <  ratio.Ptr  3  0  > 

ratio_te»pC03  3  stack!top3  i 
return(O)  »  > 

/*  INPUT  SPECIFICATIONS  %/ 

case  681  !  not.done  3  0  \  ft  end  of  staSe  2  tf 

return(O)  *  > 

case  701  <  stack!13  3  stack! top 3  i 
check (5 t  stack!top3)  > 
return(O)  >  > 

case  711  <  tO  3  stack!top3  I  readins  3  1  » 
for(J3l  i  j  <3  stack!13  >  J++) 

if (readins) 

{  sea_store!t03IJ3  3  Setfloat(d.flaS)  i 
if(d_flas  33  l)  write_d(tOf J)  i 
if(d_flas  33  -1)  <  readins  3  0  f 

write_d(tO» j)  »  > 

> 

else  urite.d(tO>J)  1 
last.coMPUtedCtOl  3  stackC13  * 
return (0)  I  > 

case  721  <  tO  3  stackCtop-13  » 

if <syi»_tab!t03.entryl  >*  syn.tabltOI .entry2)  return(O)  ! 
++syffi_tablt03. entry 1  $ 

location  3  stack!top-23  > 
return(O)  »  > 


0339  '3i  <  t2  *  stackCtop-43  i  check(2ft2)  » 

if <sym_tabCt23.type  !=  'I')  run_error(6)  » 
sy«_tabCt23.entryl  *  stackCtop-23  » 
sye_tabCt23.entry2  =  stsckCtoPl  5 
stackCiop-53  *  location  » 
return(O)  »  > 

/*  OUTPUT  SPECIFICATIONS  */ 

C3se  77:  <  tO  =  stackCtorl  * 

for<J*l  i  j  <*  stackC13  *  J++) 
if (resd_d(tOf J))  printfC  d  * )  i 
else  printf < ‘25.2f  *»  sea_storeCt03CJ3)  » 
ppintf ( *\n  m*mmm«m«m\n’)  i 
return<0)  i  > 

case  78t  <  tO  *  stackCtop-13  » 

if (sy»_tabCt03.entpyl  >®  sye_tabCt03.entry2)  return(O) 
++sy»_tabCt03. entry 1  I 
location  3  stackCtop-23  i 
return(O)  5  > 

> 

J 

/*  END  OF  THE  SEMANTICS  ROUTINES  */ 


it  User  Defined  Operators  %/ 

/% 

These  routines  are  provided  by  the  user  to  define  the 
binary  operators  U1  and  U2.  The  operands  are  passed  in 
ol  and  o2  and  the  result  is  returned  in  r. 

If  any  of  the  operands  is  the  don't  care  swebolf  tl  or  t2 
is  set  to  If  correspondinalyf  otherwise  they  are  set  to  0. 
The  return  value  of  the  functions  should  be  0  if  the 
result  of  the  operation  is  not  a  don't  care  and  1  if  it  is. 

ulopl(olf tlfo2f t2f  r)  int  tlft2  i 

float  ol>o2f  *r  ? 

< 

/*  Formulas  for  u_op1  and  r  */ 

> 

u_op2<o1 # tl f o2f t2f r )  int  tlft2  f 

float  olfo2i  *r  i 

r 

\ 

f'X  Formulas  for  u_op2  and  r  %/ 


/txf.nx**t*tttt**tt%tt**%%*****%t*t*x*%t***xt%*%%*%**%%*/ 
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/txxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ 

/* 

The  following  routine  reeds  the  next  item  from  the 
input  date  file*  It  assumes  that  one  of  the  following 
exists  on  the  file! 

15  e  floatins  point  number* 

2)  e  'd'  *  indicating  a  don't  care  item* 

3)  or  *  indicating  that  the  remaining  items  in 

the  current  serauence  are  don't  cares. 

The  different  cases  are  signaled  by  the  global  flas  d-flas 
*/ 

float  getfloatO 

< 

int  c*  int-part  5 
int  minus  3  1  > 

float  fraction  *  p_of_10  * 

uhile< (c*setcharO 5  '  '  II  c  «  '\n'  )  * 

i f ( c  33  EOF  )  run_error(9)  * 

if(c  33  'd'  )  <  d-flas  =  1  *  /*  a  don't  care  symbol  %/ 

return(O)  »> 

if(c  ==  '.'  It  (setcharO  ==  '.')  tt  (SetcharO  33  '.')  ) 

<  d_flas  3  -1  i  /%  seauence  terminated  t/ 

return(O)  *> 

if(c  =*  <  c«setchar<)  »  minus  *  -1  *> 

int-part  *  0  5 

while( isdiSit(c) )  (  int_?art  *  (10  X  int_part)  +  (c  -  '0')  * 
c  3  setcharO  *  > 
if<  c  !*  '.')  run-errror<3)  i 
fraction  =  0.0  ! 
p_of-10  3  10.0  5 

while(isdiSit(c=SetcharO))  -(fraction  3  fraction  +  (c  -  '0')  /  P_of_10J 

p-of_10  =  p_of_10  *  10.0  *  > 

if((c  ! 3  '  ' )  tt  (c  !=  '\n')  )  run_error(8)  * 

Traction  3  minus  *  <  int-Part  +  fraction  )  J 
d-f lag  3  0  t 
return( fraction)  } 


isdisit(d)  int  d  f 

< 

if (d  <=  '9'  tt  d  >m  '0')  return(l)f 
return(O)  * 

> 

/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/ 

/X 

The  following  routines  keep  track  of  the  position  of  the 
don't  care  symbols  in  the  data  seouences*  Each  entry  in 
a  seauence  has  a  corresponding  bit  in  the  array  d_table» 
urite_d(s>t)  sets  the  bit  corresponding  to  the  element  t 
of  the  seauence  s  to  1  indicating  a  don't  care  *  while 
read_d(s*t)  returns  the  value  of  the  bit  corresponding 
to  the  element  t  in  seauence  s. 

*/ 

urite-d(s*t)  int  s*t  5 

< 

int  word*  bit  r 

unsigned  pattern  3  1  * 

word  3  t  /  16  * 

bit  3  <t  16)  I 

pattern  3  pattern  «  (15-bit)  > 

d-tableCslCwordl  3  (d.tableCslCwordl)  I  pattern  5 

return (0)  ! 


read_d(s*t)  int  Sit  i 

int  word*  bit  ( 
unsigned  pattern  3  1  * 
word  3  t  /  16  ) 
bit  3  t  r.  16  ) 

pattern  3  pattern  «  < 15-bit)  ) 
pattern  3  pattern  l  d-tableCslCwordl  * 
if (pattern)  return(l)  * 
return(O)  * 

> 


/««  ERROR  ROUTINES  «***«*/ 


/* 

A  routine  to  check  the  bounds  of  working  arrays*  the  array 
to  be  checked  is  deterained  by  the  argument  i. 

*/ 

check(i»  ptr)  int  i»ptr  $ 

< 

if (ptr  <  overflowCil)  return* 1)  * 
switch* i) 


case  0 
case  1 
case  2 
case  3 
case  4 
case  S 
case  6 
case  7 

\ 

exit(O) 


<  printfC***  program  array  overflow  *X*\n') 
exit(O)  I  > 

•C  printfC***  working  stack  overflow  ***\n*> 
exit(O)  I  > 

<  printfC***  syabol  table  overflow  ***\n‘) 
exit<0)  ;  > 

<  rrintf CXXX  bound  table  overflow  ***\ n') 

exit(O)  I  > 

<  PrintfC***  seauence  store  overflow  *X\n*) 
exit<0)  1  > 

<  rrintf ( '***  MAXT  should  be  less  than  %d\n’>  Maxtiae)* 
axit(O)  *  > 

•C  printfC***  temporary  ratio  list  overflow  **\n*>» 
exit(O)  *  > 

<  printfC***  real  constants  storage  overflow  **\n*)» 
exit(O)  $  > 


/**xx*x*x*x*xx*xx****x*x**x*x*x**x*x**xx*x*x*x**xxxxx*x/ 

/* 

A  procedure  to  print  run  tiae  error  aessases  and  stop 
execution*  The  message  to  be  printed  is  deterained 
i.ty  the  arguaent  i. 

*/ 

run_error(i)  int  i  * 


•witch*  i) 

C3se 

1  1 

case 

2  : 

case 

3  : 

case 

4  ; 

case 

5  : 

case 

6  : 

case 

7  ! 

case 

8  : 

case 

9  : 

case 

10J 

case 

1 1 : 

case 

121 

case 

131 

case 

141 

case 

is: 

•C  printfC**  seouence  array  out  of  bound  \n')» 
exit(O)  )  > 

<  printfC**  too  aany  array  arguaents  \n’)  * 
exit(O)  ?  > 

<  printfC**  too  few  array  arguaents  \n‘)  * 
exit(O)  (  > 

•C  printfC**  only  paraaeters  aay  be  used  in  seau.  declaratrion  \n’)J 
exit<0)  !  > 

•C  printfC**  expecting  -a  par.  or  an  index  in  sect,  specification  \n’) 
exit(O)  *  > 

<  printfC**  FOR  variables  must  be  declared  as  INDEX  •)  i 
exit<0)  *  > 

•C  printfC**  wrong  nuaber  of  arguaents  in  Multiplexer  list  \n*)J 
exit(O)  >  > 

<  printfC**  Foraat  error  in  input  file  **\n*)  * 
exit(O)  *  > 

<  printfC**  insufficient  data  in  input  file  *X\n’)» 
exitCO)  *  > 

■C  printfC**  incorrect  model  or  non  causal  actuations  **\n')f 
exit(O)  1  > 

<  printfC**  Inconsistent  systea  of  eouations  \n')  * 
printfC**  Attempt  to  overwrite  a  seauence  \n*>  f 
*xit(0)  f  > 

<  printfC**  array  of  sectuences  not  declared  \n’)  ) 
exit(O)  >  > 

<  printfC**  seauence  not  declared  \n’>  ? 
axit(O)  f  > 

{  printfC**  missing  argument  list  \n')  * 
exit(O)  5  > 

<  printfC**  variable  already  declared  \n  •>  * 
exit(O)  !  > 


> 


> 


/*******  END  OF  PROGRAM  LISTING  ********/ 
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